MobileRead Forums - View Single Post

willus · 02-15-2023, 11:18 AM

As promised--here is a Windows batch file which uses cpdf to keep the original bitmaps but add a text layer with k2pdfopt's OCR. Should be easy to convert to linux.

Code:

rem
rem Step 1.  Do the OCR
rem          Typically set dpi to at least 300 for best results
rem          -ocrd p sets detection at the page level.  This means
rem                  that the Tesseract algorithm will be used to
rem                  find text on the page rather than the k2pdfopt
rem                  algorithm.
rem          You may also wish to add -g, -cmax, or -s option
rem          adjustments to improve the bitmap contrast or
rem          sharpness and resulting OCR quality.
rem
k2pdfopt -mode copy -dpi 300 -ocr t -ocrd p src.pdf -o temp1.pdf
rem
rem Step 2.  Replace the bitmap in the result with a very low density,
rem          low res bitmap (which will later be ignored / made invisible),
rem          but keep the text layer.
rem
k2pdfopt -mode copy -dpi 5 -bpc 1 -g 100 -cmax -100 -s- temp1.pdf -o temp2.pdf
del /q temp1.pdf
rem
rem Step 3.  Pair the text layer in temp2.pdf with the bitmaps in src.pdf.
rem          Put the result in src_searchable.pdf.
rem
cpdf -draft temp2.pdf -o temp3.pdf
del /q temp2.pdf
cpdf -combine-pages src.pdf temp3.pdf -o src_searchable.pdf
del /q temp3.pdf