View Single Post
Old 06-21-2012, 02:14 AM   #2
tomsem
Wizard
tomsem ought to be getting tired of karma fortunes by now.tomsem ought to be getting tired of karma fortunes by now.tomsem ought to be getting tired of karma fortunes by now.tomsem ought to be getting tired of karma fortunes by now.tomsem ought to be getting tired of karma fortunes by now.tomsem ought to be getting tired of karma fortunes by now.tomsem ought to be getting tired of karma fortunes by now.tomsem ought to be getting tired of karma fortunes by now.tomsem ought to be getting tired of karma fortunes by now.tomsem ought to be getting tired of karma fortunes by now.tomsem ought to be getting tired of karma fortunes by now.
 
Posts: 2,478
Karma: 2534423
Join Date: Apr 2009
Location: USA
Device: Kindle PW, iPad Air 2, Fire HD6
Scanned-and-OCR-ed PDFs can be a real mess, and I'd include this in that category (not that Google's are any better). I don't have my Fire handy, but I did some clean-up with Acrobat and loaded it successfully on my Kindle Touch (my first attempt, with just some cropping and 'reduce file size' applied, failed). Even so, there were issues with text selection and the grey background (the scan is in color, with yellowish paper) did not help any. To improve on it, I think I would export the pages as images (preserving whatever resolution is there), do some image processing to clean it up (remove colored background) and then do a fresh OCR with something that works well (e.g Acrobat).

I blame archive.org's workflow. It is really too bad, you only want to do this sort of thing one time, they should do it right and get whatever tools are needed to do it.

It would be nice to have bookmarks and linked TOC as well. Probably no automation is up to that, but they could crowd-source this task.
tomsem is offline   Reply With Quote