View Single Post
Old 01-23-2012, 10:43 AM   #2
DSpider
Evangelist
DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.
 
DSpider's Avatar
 
Posts: 450
Karma: 343115
Join Date: Nov 2009
Location: Romania
Device: PW2 2014
TIFF files can't have text layers underneath. You will obviously lose the OCR (text) if you export the PDF as a bunch of images. Even if you could keep it somehow, the page splitting, deskewing, cropping process will mess with the positional OCR. Here's what you could do:
  • export the PDF as a bunch of PNG images - I would advise against JPG because it could make compression artefacts stand out more, which could result in grain or fuzzy text after Scan Tailor
  • run them through Scan Tailor
  • if you really care about the ability to search, highlight text or copy-paste (for a dictionary look-up, maybe?), you could re-apply "good enough" OCR with ABBYY FineReader; alternatively you could look for some GUI based on tesseract
  • export as PDF

Of course, you could go all the way and proofread the OCR either in FineReader or side-by-side, save as .docx or .rtf, track down the fonts, vectorize the cover and any other graphics, do the layout in Word or InDesign and proofread the final product again. This will result in a much smaller file of a substantially better quality. It does take time, yes, but it's a pleasure to read such a book.
DSpider is offline   Reply With Quote