We are going to be taking a markdown text document. Rendering it to a pdf. Converting it to a png. Converting it to a grayscale png and then applying ocr to get the text. This is almost a loop

Setup

brew install pandoc basictex imagemagick tesseract
eval "$(/usr/libexec/path_helper)"

Rendering a PDF

cd ocr
pandoc ../_posts/2023-03-05-ways-of-working.md \
    -o ways-of-working.pdf \
    --include-in-header=<(echo "\pagenumbering{gobble}")

Convert it to a high quality grayscale image

convert -density 300 -type Grayscale ways-of-working.pdf ways-of-working-%d.png

ways-of-working-0 ways-of-working-1

Read the image

cat \
    <(tesseract -l eng ways-of-working-0.png -) \
    <(tesseract -l eng ways-of-working-1.png -) \
> ways-of-working.txt