02-13-2013, 06:10 PM | #1 |
Member
Posts: 10
Karma: 10
Join Date: May 2012
Device: kindle
|
Help optimizing scanned PDF
Does anyone have experience with scanning a book and then optimizing it as a pdf file?
Since I'm fortunate to have an undergraduate schlepper at work who does photocopies for us underpaid professors, I decided to try an ebook experiment today. I had her xerox an entire book. Our Xerox machine's sheet feeder will then allow you to email an entire set of pages to yourself in various formats. The book was 51 (double column) pages. I find that selecting "compact pdf" results in a file that's not to large but fully readible. So the resultant document is 2MB. I decided to run it through OCR software (Nitro 7) so I could have a document with searchable text. There are few images in the book and none on the pages that contain text. Here's where it starts to get confusing. I used the default settings "searchable text image". I ended up with a 60mb file. And I don't understand why. Why was is it 30x larger than the original? I then tried the alternative setting - "editable text". The resulting document looked the same except the few images and some artifacts were removed. But the file was still 7MB, considerably larger than the original. The only other thing I played around with was the "optimize pdf" feature - using the 7mb file. I removed the embedded fonts. I ended up with a 460kb file, that, near as I can tell, looks the same. I understand in principle what embedding fonts means - so that the doc will look exactly the same on all machines - but the book has few distinct fonts in it. So I'm a bit perplexed at how best to optimize a pdf. I want to keep the file sizes small, but I don't want to lose legibility. I see myself doing this a great deal in the future with books I get from Inter-library loan. It's far easier for my research to have them electronically, to read and annotate on my iPad. And having the text searchable is a major asset. The Nitro user guide is less than helpful. These aren't chemistry or economics textbooks, so there aren't flowcharts, pie graphs and what have you. They're mostly text. |
02-14-2013, 03:59 AM | #2 |
Evangelist
Posts: 450
Karma: 343115
Join Date: Nov 2009
Location: Romania
Device: PW2 2014
|
My advice is NOT to scan them directly as PDF, but as images (preferably TIFF or PNG). Then run the images through Scan Tailor and then through ABBYY FineReader if you want some search functionality. This is the "quick and dirty" way.
The "slow and of-an-exceptionally-high-quality way" would be to OCR it with ABBYY FineReader Professional, proofread the entire thing, process the graphics, track down the fonts, redo the layout in Word or InDesign, export as PDF (and maybe tweak a few things in Acrobat), and proofread again the final product. Not many people are willing to spend the time and effort for this, but the result is of very high quality. It's always a pleasure to read such a book. But first make sure that it's worth it, and that it's not already available as an e-book. Open-source alternatives: LibreOffice, GIMP, Scribus. Last edited by DSpider; 02-14-2013 at 05:09 AM. |
02-14-2013, 03:59 AM | #3 | ||||
Wizard
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
Are these clean scans? Or are there lots of speckles, page edges, scanning artifacts. Is this just for your own usage, or for others? (If only for your usage, cleaning up the PDF won't really matter if you are fine with the quality). Quote:
Quote:
There are other OCR programs out there. Here is a list of them on Wikipedia: https://en.wikipedia.org/wiki/Compar...ition_software I personally use ABBYY Finereader. Quote:
Or are you just trying to output the OCRed text/images only? |
||||
02-14-2013, 08:20 AM | #4 |
Member
Posts: 10
Karma: 10
Join Date: May 2012
Device: kindle
|
Thanks!
It's for my own usage. So I don't really care if it doesn't look pretty and if the scan is let's say 95-98% accurate. i could deal with the occasional typo and I won't throw away the Xerox in case there is something I need to check it against when reading. Ultimately, my purpose here is convenience and time saving [You can skip this part as it's not about pdf optimization]: They are academic (humanities) books I use in my research. My usual process for using a book in research is: -Read the book and make little annotations near the relevant parts -Xerox only those pages that contain what I may need to quote and cite when writing -Scan them in to the PC as jpegs (or as a pdf) -Take notes in MS Word on the book including brief summaries about the specific passages I may need to cite and where to find them in the book. -If I saved them as jpegs then each jpeg will bear the name of its page number. I'm sure this sounds tedious to you, but trust me, when it came time to writing my dissertation (2006) having all my material scanned into the computer (and having two monitors) made life considerably easier. No stacks of papers spread out all over my floor; no serious time wasted transcribing hundreds of quotes, half of which I didn't end up using; and having all my material stored in a flash drive so I could write wherever I was. Obviously reading ebooks (as pdfs) on my iPad eliminates many of these steps. And it is so with the books I am able to find as ebooks. So it occurred to me to experiment by scanning one in in its entirety. Some things to note: -It's the xerox machine that sends it as a "compact pdf". It's one of the settings. What it does exactly I have no idea, but an otherwise 10-15mb file becomes less than 2mb if I select compact. i can see no difference in the results. And I had no trouble running the compact pdf through OCR. So my goal here is to (1) save time and make things more convenient (2) not end up with massive files (3) without sacrificing (or rather risking) reliability. Near as I can tell it's the embedded fonts on nitro that is adding the bloat - how else to explain 500kb instead of 7mb. 500kb sounds like a normal size for an ebook. But can someone explain the differences between "searchable text image," and "editable text" and what is at stake between choosing one over the other? And whether removing embedded fonts matters or not? Again this is all for myself. I'm not trying to created a pirated ebook to circulate. But I do need to be confident that it will look OK on multiple PCs and on future versions of Windows and iOS and what have you. I know a jpeg will never be an issue. But with these PDFs, I have no idea. It's about saving me time without wasting too much hdd space. |
02-15-2013, 01:04 AM | #5 | ||||
Wizard
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
You don't want to save scanned documents as jpg. JPG is a lossy format, and is pretty atrocious on text documents. Since the Xerox already outputs as PDF, I would recommend that. Other formats that can be used for the original scans are any of the lossless image formats such as PNG or TIFF.
Quote:
Quote:
Quote:
Quote:
http://nitropdf.helpmax.net/en/tasks...-existing-pdf/ Well since this is only for your own personal usage, and if you are ok without the embedded fonts.... then remove the fonts for much smaller files. Oh yes it will be! Soon your eyes will go bad and you will want to zoom in to the text and all you will see is hideous pixelated blobs. I don't see PDFs disappearing any time soon. |
||||
02-16-2013, 08:29 AM | #6 |
Member
Posts: 10
Karma: 10
Join Date: May 2012
Device: kindle
|
Thanks.
The jpegs are actually quite useful once I've gone through the process of naming them according to the page number. Very easy to track down a quote because my notes reference a page number. And not having it OCRed in such cases isn't the biggest deal. The number of quotes I end up using are far fewer than the number I highlight while writing. But I would never want to have to read a whole book in such a manner. As for the embedded fonts and heir removal. I'm just imagining a hypothetical situation where many years down the road I'm on a different operating system, there's been some major technological changes. And I open up this PDF and the document won't render because some of he fonts or whatever no longer exist and I can't read it. The few times I've opened up a word doc on Pages on my iPad I've seen problems. And every time I try to open up something I wrote as an undergrad back in the early 90s (using MS Write or WordPerfect) the files are messed up. So a jpeg may be lossy and hard on the eyes, but I know it will always look exactly the same regardless of the environment. |
02-16-2013, 12:11 PM | #7 |
Evangelist
Posts: 450
Karma: 343115
Join Date: Nov 2009
Location: Romania
Device: PW2 2014
|
If you're going the "image-only" route, at least process them with Scan Tailor and archive them as ZIP or RAR. Another thing that you can count on long-term compatibility is HTML, which is easy to edit and easy to convert, but you will need to proofread the whole thing (the OCR content) against the scanned images, at least once.
|
02-16-2013, 02:40 PM | #8 |
Addict
Posts: 294
Karma: 1196776
Join Date: Nov 2008
Location: Bulgaria
Device: Kindle 4 NT, Onyx Boox M92
|
Instead of Nitro try Pdf Xchange Viewer. From the menu choose Document-> OCR Pages and when the dialog shows up make sure to select for "PDF output type" "Preserve original content & add text layer". After the job is done, just save the file and you will have a slightly bigger pdf file. Note that if you choose the other option for "PDF output type", the file size increases significantly.
|
02-21-2013, 08:44 AM | #9 |
Fuzzball, the purple cat
Posts: 1,273
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
|
"Compact PDF" on K-M copiers
I was amazed at how small the "Compact PDF" files come out on Konica-Minolta copiers, so I checked them out a little. A normal PDF output on Konica-Minoltas uses a JPEG-embedded (I think) PDF with typical quality settings for readability and several bits per color component, but "Compact PDF" compresses even further by dropping the bits-per-pixel considerably. I noticed with one document where I had red markings on an otherwise black-and-white document, "Compact PDF" stored the PDF in two layers--a black-and-white layer and a red layer, each one with very few bits per color.
|
02-21-2013, 11:05 AM | #10 |
Banned
Posts: 488
Karma: 1080260
Join Date: Sep 2012
Device: sony prs t1 kindle dx ipad
|
If you could provide me link to your 2 MB book i could do OCR in Adobe Acrobat, Abbyy Finereader etc. and tell you the difference between editable text and searchable text image in those applications that i usually use for pdf optimization.
Last edited by markom; 02-21-2013 at 11:13 AM. |
02-21-2013, 06:49 PM | #11 | ||
Wizard
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
Editable: Allows you to save as RTF, DOC, DOCX, ODT Formatted: RTF, DOC, DOCX, ODT, XLS, XLSX, TXT, HTML, FB2, EPUB When saving as a PDF though, you have multiple ways of doing it. If you go into the Settings, you are able to choose "Save Mode":
Here are comparisons of the book I am currently working on (Finereader 11): Original (13.7 MB PDF). I assume this version was just fed through some Adobe OCR built into a scanner: http://library.mises.org/books/Lione...evelopment.pdf Text Under The Page (7.34 MB PDF): http://www.mediafire.com/view/?9d2ft4bunocnkou Text/Picture Only (802 KB PDF): http://www.mediafire.com/view/?mue0znuycy55l9i Text/Picture Only, No Embedded Fonts (591 KB PDF): http://www.mediafire.com/view/?drb02hcjmxuwvbd I decided to pick RTF since it can be saved both in "Editable" and "Formatted". Here is an image comparing the Formatted/Editable output from Finereader: http://www.imagebam.com/image/5d1a68238610245 Formatted RTF (1.30 MB RTF): http://www.mediafire.com/view/?ce0jd2p1rzx0ibf Editable RTF (1.32 MB RTF): http://www.mediafire.com/view/?b42v1jk946uih76 In my testing between Adobe/Finereader, Finereader makes much smaller filesizes, AND has more accurate OCR. In the original poster's case, I would still stick with my usual recommendation of, keeping the Original scan as a frontend, and having the OCRed text in the backend. Quote:
I personally just work with already scanned (mostly black and white) non-fiction books. Since there are only two colors, black, and white, you can imagine that they compress quite well. Back to the OCR of documents, the auto-OCR on these scanners are ok (from what I have seen, many of these are based off of some sort of Adobe program), but if you look at the text, you can always see that there are the typical OCR errors. I feel that an outside program (I use Finereader), will give you a much more accurate OCR than those that come bundled with the scanner. In my mind, a more accurate OCR = closer to the original book = a much more enjoyable reading experience. My work is to convert the books into digital form (EPUB), so I need a nearly 100% correct conversion... and while I am at it, I can toss out that auto-OCRed stuff, and make a nearly 100% accurate PDF text backend as well. On top of that, Finereader seems to have even better ways of making the PDFs smaller than those scanners. So I just see it as win-win-win-win-win. Last edited by Tex2002ans; 02-21-2013 at 06:53 PM. |
||
02-22-2013, 06:17 AM | #12 |
Evangelist
Posts: 450
Karma: 343115
Join Date: Nov 2009
Location: Romania
Device: PW2 2014
|
"Black and white" can be different for many people... Some refer to it this way, even though it's actually grayscale, while other refer to it this way, even though it's actually a 1 bit image (black)! 1 bit images compress better. It's the same output that you would get from processing them with Scan Tailor.
|
02-22-2013, 07:57 AM | #13 |
Fuzzball, the purple cat
Posts: 1,273
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
|
DIY scanning
Of course, if you want to get really serious about book scanning, you can make yourself one of these.
|
02-22-2013, 08:02 AM | #14 |
Addict
Posts: 294
Karma: 1196776
Join Date: Nov 2008
Location: Bulgaria
Device: Kindle 4 NT, Onyx Boox M92
|
|
Tags |
embedded fonts, optimize, pdf |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Help with a scanned pdf file | Jach234 | Amazon Kindle | 5 | 04-16-2011 02:17 AM |
scanned pdf | excalibra | 5 | 04-08-2011 04:41 AM | |
Scanned pdf's issue | ululu | Sony Reader | 1 | 11-18-2010 06:45 PM |
Anyone have optimizing tips for PDF files? | stevet | 4 | 03-05-2010 12:40 AM | |
PRS-600 Dictionary on scanned PDF? | antistar | Sony Reader | 8 | 11-29-2009 03:05 PM |