Originally Posted by markom
[...] Abbyy Finereader etc. and tell you the difference between editable text and searchable text image in those applications that i usually use for pdf optimization.
In Finereader 11:
Editable: Allows you to save as RTF, DOC, DOCX, ODT
Formatted: RTF, DOC, DOCX, ODT, XLS, XLSX, TXT, HTML, FB2, EPUB
When saving as a PDF though, you have multiple ways of doing it. If you go into the Settings, you are able to choose "Save Mode":
- Text and Pictures Only
- This will only save the OCRed text, and will try to keep in the spirit of the original layout (sometimes you see a glitched line or two that fly off the page)
- Text Over the Page Image
- I have seen this way break on a few PDF readers, and most assume that text is in the invisible backend, not a scan.
- Text under the page image
- I recommend this so you have the original scan as well.
- You will be able to read the original scanned document (and in the future be able to do any work on it that is needed).
- For example, if a new, even more accurate OCR program came out, you will be able to feed it this PDF.
- You can still search the document/copy/paste perfectly fine.
Here are comparisons of the book I am currently working on (Finereader 11):
Original (13.7 MB PDF). I assume this version was just fed through some Adobe OCR built into a scanner:
Text Under The Page (7.34 MB PDF):
Text/Picture Only (802 KB PDF):
Text/Picture Only, No Embedded Fonts (591 KB PDF):
I decided to pick RTF since it can be saved both in "Editable" and "Formatted".
Here is an image comparing the Formatted/Editable output from Finereader:
Formatted RTF (1.30 MB RTF):
Editable RTF (1.32 MB RTF):
In my testing between Adobe/Finereader, Finereader makes much smaller filesizes, AND has more accurate OCR.
In the original poster's case, I would still stick with my usual recommendation of, keeping the Original scan as a frontend, and having the OCRed text in the backend.
Originally Posted by willus
I noticed with one document where I had red markings on an otherwise black-and-white document, "Compact PDF" stored the PDF in two layers--a black-and-white layer and a red layer, each one with very few bits per color.
That sounds like they do a fantastic job at making PDFs much smaller. I assume all of these scanners have their own little tiny proprietary tweaks to try to get their scanned PDFs smaller. Chopping out unused colors is one way to get the filesize way down. The book doesn't have all of the colors in the rainbow!
I personally just work with already scanned (mostly black and white) non-fiction books. Since there are only two colors, black, and white, you can imagine that they compress quite well.
Back to the OCR of documents, the auto-OCR on these scanners are ok (from what I have seen, many of these are based off of some sort of Adobe program), but if you look at the text, you can always see that there are the typical OCR errors.
I feel that an outside program (I use Finereader), will give you a much more accurate OCR than those that come bundled with the scanner. In my mind, a more accurate OCR = closer to the original book = a much more enjoyable reading experience.
My work is to convert the books into digital form (EPUB), so I need a nearly 100% correct conversion... and while I am at it, I can toss out that auto-OCRed stuff, and make a nearly 100% accurate PDF text backend as well.
On top of that, Finereader seems to have even better ways of making the PDFs smaller than those scanners. So I just see it as win-win-win-win-win.