MobileRead Forums - View Single Post

hobnail · 01-30-2020, 04:01 PM

Quote:

Originally Posted by FrustratedReader

Archive.org is terrible. Usually no proofing. I don't bother downloading epub/mobi if it's only their own OCR of the pdf. If it's from Microsoft or Google Books scan, then PDF is best.

I don't know if everyone else knows this but PDFs can have this clever thing (to me anyways) where there are 2 layers. The visible layer is what you see when you open it in a PDF viewer or whatever, and the invisible layer is the OCR'd text. You can tell if it has the OCR'd text if you click and drag your mouse over the text and it selects stuff, probably not exactly lined up with the visible text. If your PDF viewer supports it you can open the PDF and then do a Save As and select text as the output format and save the OCR'd text. It's very likely the exact same text that you get when you download the .txt file from archive.org but sometimes it can be helpful to access that text as you're looking at the scanned image of the page.