View Single Post
Old 02-15-2023, 10:18 AM   #10
willus
Fuzzball, the purple cat
willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.
 
willus's Avatar
 
Posts: 1,303
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
As promised--here is a Windows batch file which uses cpdf to keep the original bitmaps but add a text layer with k2pdfopt's OCR. Should be easy to convert to linux.
Code:
rem
rem Step 1.  Do the OCR
rem          Typically set dpi to at least 300 for best results
rem          -ocrd p sets detection at the page level.  This means
rem                  that the Tesseract algorithm will be used to
rem                  find text on the page rather than the k2pdfopt
rem                  algorithm.
rem          You may also wish to add -g, -cmax, or -s option
rem          adjustments to improve the bitmap contrast or
rem          sharpness and resulting OCR quality.
rem
k2pdfopt -mode copy -dpi 300 -ocr t -ocrd p src.pdf -o temp1.pdf
rem
rem Step 2.  Replace the bitmap in the result with a very low density,
rem          low res bitmap (which will later be ignored / made invisible),
rem          but keep the text layer.
rem
k2pdfopt -mode copy -dpi 5 -bpc 1 -g 100 -cmax -100 -s- temp1.pdf -o temp2.pdf
del /q temp1.pdf
rem
rem Step 3.  Pair the text layer in temp2.pdf with the bitmaps in src.pdf.
rem          Put the result in src_searchable.pdf.
rem
cpdf -draft temp2.pdf -o temp3.pdf
del /q temp2.pdf
cpdf -combine-pages src.pdf temp3.pdf -o src_searchable.pdf
del /q temp3.pdf
willus is offline   Reply With Quote