MobileRead Forums - View Single Post

DSpider · 01-23-2012, 10:43 AM

TIFF files can't have text layers underneath. You will obviously lose the OCR (text) if you export the PDF as a bunch of images. Even if you could keep it somehow, the page splitting, deskewing, cropping process will mess with the positional OCR. Here's what you could do:

export the PDF as a bunch of PNG images - I would advise against JPG because it could make compression artefacts stand out more, which could result in grain or fuzzy text after Scan Tailor
run them through Scan Tailor
if you really care about the ability to search, highlight text or copy-paste (for a dictionary look-up, maybe?), you could re-apply "good enough" OCR with ABBYY FineReader; alternatively you could look for some GUI based on tesseract
export as PDF

Of course, you could go all the way and proofread the OCR either in FineReader or side-by-side, save as .docx or .rtf, track down the fonts, vectorize the cover and any other graphics, do the layout in Word or InDesign and proofread the final product again. This will result in a much smaller file of a substantially better quality. It does take time, yes, but it's a pleasure to read such a book.

01-23-2012, 10:43 AM	#2
DSpider Evangelist Posts: 450 Karma: 343115 Join Date: Nov 2009 Location: Romania Device: PW2 2014	TIFF files can't have text layers underneath. You will obviously lose the OCR (text) if you export the PDF as a bunch of images. Even if you could keep it somehow, the page splitting, deskewing, cropping process will mess with the positional OCR. Here's what you could do: export the PDF as a bunch of PNG images - I would advise against JPG because it could make compression artefacts stand out more, which could result in grain or fuzzy text after Scan Tailor run them through Scan Tailor if you really care about the ability to search, highlight text or copy-paste (for a dictionary look-up, maybe?), you could re-apply "good enough" OCR with ABBYY FineReader; alternatively you could look for some GUI based on tesseract export as PDF Of course, you could go all the way and proofread the OCR either in FineReader or side-by-side, save as .docx or .rtf, track down the fonts, vectorize the cover and any other graphics, do the layout in Word or InDesign and proofread the final product again. This will result in a much smaller file of a substantially better quality. It does take time, yes, but it's a pleasure to read such a book.