Local OCR WiP
We are going to be taking a markdown text document. Rendering it to a pdf. Converting it to a png. Converting it to a grayscale png and then applying ocr to get the text. This is almost a loop
Setup
brew install pandoc basictex imagemagick tesseract
eval "$(/usr/libexec/path_helper)"
Rendering a PDF
cd ocr
pandoc ../_posts/2023-03-05-ways-of-working.md \
-o ways-of-working.pdf \
--include-in-header=<(echo "\pagenumbering{gobble}")
Convert it to a high quality grayscale image
convert -density 300 -type Grayscale ways-of-working.pdf ways-of-working-%d.png
Read the image
cat \
<(tesseract -l eng ways-of-working-0.png -) \
<(tesseract -l eng ways-of-working-1.png -) \
> ways-of-working.txt