Help optimizing scanned PDF
Does anyone have experience with scanning a book and then optimizing it as a pdf file?
Since I'm fortunate to have an undergraduate schlepper at work who does photocopies for us underpaid professors, I decided to try an ebook experiment today.
I had her xerox an entire book. Our Xerox machine's sheet feeder will then allow you to email an entire set of pages to yourself in various formats. The book was 51 (double column) pages. I find that selecting "compact pdf" results in a file that's not to large but fully readible.
So the resultant document is 2MB. I decided to run it through OCR software (Nitro 7) so I could have a document with searchable text.
There are few images in the book and none on the pages that contain text.
Here's where it starts to get confusing.
I used the default settings "searchable text image". I ended up with a 60mb file. And I don't understand why. Why was is it 30x larger than the original?
I then tried the alternative setting - "editable text". The resulting document looked the same except the few images and some artifacts were removed. But the file was still 7MB, considerably larger than the original.
The only other thing I played around with was the "optimize pdf" feature - using the 7mb file. I removed the embedded fonts. I ended up with a 460kb file, that, near as I can tell, looks the same.
I understand in principle what embedding fonts means - so that the doc will look exactly the same on all machines - but the book has few distinct fonts in it.
So I'm a bit perplexed at how best to optimize a pdf. I want to keep the file sizes small, but I don't want to lose legibility. I see myself doing this a great deal in the future with books I get from Inter-library loan. It's far easier for my research to have them electronically, to read and annotate on my iPad. And having the text searchable is a major asset.
The Nitro user guide is less than helpful.
These aren't chemistry or economics textbooks, so there aren't flowcharts, pie graphs and what have you. They're mostly text.