Help optimizing scanned PDF

shmendrapolk · 02-13-2013, 06:10 PM

Does anyone have experience with scanning a book and then optimizing it as a pdf file?

Since I'm fortunate to have an undergraduate schlepper at work who does photocopies for us underpaid professors, I decided to try an ebook experiment today.

I had her xerox an entire book. Our Xerox machine's sheet feeder will then allow you to email an entire set of pages to yourself in various formats. The book was 51 (double column) pages. I find that selecting "compact pdf" results in a file that's not to large but fully readible.

So the resultant document is 2MB. I decided to run it through OCR software (Nitro 7) so I could have a document with searchable text.
There are few images in the book and none on the pages that contain text.

Here's where it starts to get confusing.

I used the default settings "searchable text image". I ended up with a 60mb file. And I don't understand why. Why was is it 30x larger than the original?

I then tried the alternative setting - "editable text". The resulting document looked the same except the few images and some artifacts were removed. But the file was still 7MB, considerably larger than the original.

The only other thing I played around with was the "optimize pdf" feature - using the 7mb file. I removed the embedded fonts. I ended up with a 460kb file, that, near as I can tell, looks the same.

I understand in principle what embedding fonts means - so that the doc will look exactly the same on all machines - but the book has few distinct fonts in it.

So I'm a bit perplexed at how best to optimize a pdf. I want to keep the file sizes small, but I don't want to lose legibility. I see myself doing this a great deal in the future with books I get from Inter-library loan. It's far easier for my research to have them electronically, to read and annotate on my iPad. And having the text searchable is a major asset.

The Nitro user guide is less than helpful.

These aren't chemistry or economics textbooks, so there aren't flowcharts, pie graphs and what have you. They're mostly text.

DSpider · 02-14-2013, 03:59 AM

My advice is NOT to scan them directly as PDF, but as images (preferably TIFF or PNG). Then run the images through Scan Tailor and then through ABBYY FineReader if you want some search functionality. This is the "quick and dirty" way.

The "slow and of-an-exceptionally-high-quality way" would be to OCR it with ABBYY FineReader Professional, proofread the entire thing, process the graphics, track down the fonts, redo the layout in Word or InDesign, export as PDF (and maybe tweak a few things in Acrobat), and proofread again the final product. Not many people are willing to spend the time and effort for this, but the result is of very high quality. It's always a pleasure to read such a book. But first make sure that it's worth it, and that it's not already available as an e-book.

Open-source alternatives: LibreOffice, GIMP, Scribus.

Tex2002ans · 02-14-2013, 03:59 AM

Quote:

Originally Posted by shmendrapolk

Does anyone have experience with scanning a book and then optimizing it as a pdf file?

Mind giving a sample of what you are working with? An entire page, or a piece of a page (a page chopped in half horizontally), so we could see what we are working with exactly.

Are these clean scans? Or are there lots of speckles, page edges, scanning artifacts.

Is this just for your own usage, or for others? (If only for your usage, cleaning up the PDF won't really matter if you are fine with the quality).

Quote:

Originally Posted by shmendrapolk

The book was 51 (double column) pages. I find that selecting "compact pdf" results in a file that's not to large but fully readible.

Where is this "Compact PDF" selection being chosen (on the Xerox, or in Nitro)?

Quote:

Originally Posted by shmendrapolk

I decided to run it through OCR software (Nitro 7) so I could have a document with searchable text.

.....

I used the default settings "searchable text image". I ended up with a 60mb file. And I don't understand why. Why was is it 30x larger than the original?

That is a problem with Nitro's output settings which are creating extremely bloated documents.

There are other OCR programs out there. Here is a list of them on Wikipedia:

https://en.wikipedia.org/wiki/Compar...ition_software

I personally use ABBYY Finereader.

Quote:

Originally Posted by shmendrapolk

I then tried the alternative setting - "editable text". The resulting document looked the same except the few images and some artifacts were removed. But the file was still 7MB, considerably larger than the original.

Is your goal to have the original scan frontend, with a text backend?

Or are you just trying to output the OCRed text/images only?

shmendrapolk · 02-14-2013, 08:20 AM

Thanks!

It's for my own usage. So I don't really care if it doesn't look pretty and if the scan is let's say 95-98% accurate. i could deal with the occasional typo and I won't throw away the Xerox in case there is something I need to check it against when reading.

Ultimately, my purpose here is convenience and time saving [You can skip this part as it's not about pdf optimization]:
They are academic (humanities) books I use in my research.
My usual process for using a book in research is:
-Read the book and make little annotations near the relevant parts
-Xerox only those pages that contain what I may need to quote and cite when writing
-Scan them in to the PC as jpegs (or as a pdf)
-Take notes in MS Word on the book including brief summaries about the specific passages I may need to cite and where to find them in the book.
-If I saved them as jpegs then each jpeg will bear the name of its page number.
I'm sure this sounds tedious to you, but trust me, when it came time to writing my dissertation (2006) having all my material scanned into the computer (and having two monitors) made life considerably easier. No stacks of papers spread out all over my floor; no serious time wasted transcribing hundreds of quotes, half of which I didn't end up using; and having all my material stored in a flash drive so I could write wherever I was.
Obviously reading ebooks (as pdfs) on my iPad eliminates many of these steps. And it is so with the books I am able to find as ebooks.

So it occurred to me to experiment by scanning one in in its entirety.

Some things to note:
-It's the xerox machine that sends it as a "compact pdf". It's one of the settings. What it does exactly I have no idea, but an otherwise 10-15mb file becomes less than 2mb if I select compact. i can see no difference in the results. And I had no trouble running the compact pdf through OCR.

So my goal here is to (1) save time and make things more convenient (2) not end up with massive files (3) without sacrificing (or rather risking) reliability.

Near as I can tell it's the embedded fonts on nitro that is adding the bloat - how else to explain 500kb instead of 7mb.
500kb sounds like a normal size for an ebook.

But can someone explain the differences between "searchable text image," and "editable text" and what is at stake between choosing one over the other? And whether removing embedded fonts matters or not?

Again this is all for myself. I'm not trying to created a pirated ebook to circulate. But I do need to be confident that it will look OK on multiple PCs and on future versions of Windows and iOS and what have you. I know a jpeg will never be an issue. But with these PDFs, I have no idea.
It's about saving me time without wasting too much hdd space.

Tex2002ans · 02-15-2013, 01:04 AM

Quote:

Originally Posted by shmendrapolk

Scan them in to the PC as jpegs (or as a pdf)

You don't want to save scanned documents as jpg. JPG is a lossy format, and is pretty atrocious on text documents. Since the Xerox already outputs as PDF, I would recommend that. Other formats that can be used for the original scans are any of the lossless image formats such as PNG or TIFF.

Quote:

Originally Posted by shmendrapolk

I'm sure this sounds tedious to you, but trust me, when it came time to writing my dissertation (2006) having all my material scanned into the computer (and having two monitors) made life considerably easier.

No way! I understand completely. Digital files that are properly OCRed are much easier to use than physical books. Searching through documents/entire books is a breeze! So many times with the physical book I got stuck on "well I remember him mentioning something about topic X... now which page was that in the book?"

Quote:

Originally Posted by shmendrapolk

no serious time wasted transcribing hundreds of quotes, half of which I didn't end up using; and having all my material stored in a flash drive so I could write wherever I was.

Being able to copy and paste alone probably saves massive amounts of time. So boring having to type out a paragraph or two of text out of a physical book!

Quote:

Originally Posted by shmendrapolk

Some things to note:
-It's the xerox machine that sends it as a "compact pdf". It's one of the settings. What it does exactly I have no idea, but an otherwise 10-15mb file becomes less than 2mb if I select compact. i can see no difference in the results. And I had no trouble running the compact pdf through OCR.

What "compact PDF" most likely does is just run some lossless compression on the scans resulting in no loss in quality. While just exporting as a "normal PDF" would be exporting the uncompressed image files.

Quote:

Originally Posted by shmendrapolk

But can someone explain the differences between "searchable text image," and "editable text" and what is at stake between choosing one over the other?

I cannot make one bit of sense out the documentation (I see what you mean by "not being very helpful"):

http://nitropdf.helpmax.net/en/tasks...-existing-pdf/

Quote:

Originally Posted by shmendrapolk

And whether removing embedded fonts matters or not

Well since this is only for your own personal usage, and if you are ok without the embedded fonts.... then remove the fonts for much smaller files.

Quote:

Originally Posted by shmendrapolk

I know a jpeg will never be an issue.

Oh yes it will be! Soon your eyes will go bad and you will want to zoom in to the text and all you will see is hideous pixelated blobs.

Quote:

Originally Posted by shmendrapolk

But with these PDFs, I have no idea.

I don't see PDFs disappearing any time soon.

shmendrapolk · 02-16-2013, 08:29 AM

Thanks.
The jpegs are actually quite useful once I've gone through the process of naming them according to the page number. Very easy to track down a quote because my notes reference a page number.
And not having it OCRed in such cases isn't the biggest deal. The number of quotes I end up using are far fewer than the number I highlight while writing.
But I would never want to have to read a whole book in such a manner.

As for the embedded fonts and heir removal. I'm just imagining a hypothetical situation where many years down the road I'm on a different operating system, there's been some major technological changes. And I open up this PDF and the document won't render because some of he fonts or whatever no longer exist and I can't read it.
The few times I've opened up a word doc on Pages on my iPad I've seen problems.
And every time I try to open up something I wrote as an undergrad back in the early 90s (using MS Write or WordPerfect) the files are messed up.

So a jpeg may be lossy and hard on the eyes, but I know it will always look exactly the same regardless of the environment.

DSpider · 02-16-2013, 12:11 PM

If you're going the "image-only" route, at least process them with Scan Tailor and archive them as ZIP or RAR. Another thing that you can count on long-term compatibility is HTML, which is easy to edit and easy to convert, but you will need to proofread the whole thing (the OCR content) against the scanned images, at least once.

slex · 02-16-2013, 02:40 PM

Instead of Nitro try Pdf Xchange Viewer. From the menu choose Document-> OCR Pages and when the dialog shows up make sure to select for "PDF output type" "Preserve original content & add text layer". After the job is done, just save the file and you will have a slightly bigger pdf file. Note that if you choose the other option for "PDF output type", the file size increases significantly.

willus · 02-21-2013, 08:44 AM

Quote:

Originally Posted by Tex2002ans

What "compact PDF" most likely does is just run some lossless compression on the scans resulting in no loss in quality. While just exporting as a "normal PDF" would be exporting the uncompressed image files.

I was amazed at how small the "Compact PDF" files come out on Konica-Minolta copiers, so I checked them out a little. A normal PDF output on Konica-Minoltas uses a JPEG-embedded (I think) PDF with typical quality settings for readability and several bits per color component, but "Compact PDF" compresses even further by dropping the bits-per-pixel considerably. I noticed with one document where I had red markings on an otherwise black-and-white document, "Compact PDF" stored the PDF in two layers--a black-and-white layer and a red layer, each one with very few bits per color.

markom · 02-21-2013, 11:05 AM

Quote:

Originally Posted by shmendrapolk

Does anyone have experience with scanning a book and then optimizing it as a pdf file?

...

If you could provide me link to your 2 MB book i could do OCR in Adobe Acrobat, Abbyy Finereader etc. and tell you the difference between editable text and searchable text image in those applications that i usually use for pdf optimization.

Tex2002ans · 02-21-2013, 06:49 PM

Quote:

Originally Posted by markom

[...] Abbyy Finereader etc. and tell you the difference between editable text and searchable text image in those applications that i usually use for pdf optimization.

In Finereader 11:

Editable: Allows you to save as RTF, DOC, DOCX, ODT
Formatted: RTF, DOC, DOCX, ODT, XLS, XLSX, TXT, HTML, FB2, EPUB

When saving as a PDF though, you have multiple ways of doing it. If you go into the Settings, you are able to choose "Save Mode":

Text and Pictures Only
- This will only save the OCRed text, and will try to keep in the spirit of the original layout (sometimes you see a glitched line or two that fly off the page)
Text Over the Page Image
- I have seen this way break on a few PDF readers, and most assume that text is in the invisible backend, not a scan.
Text under the page image
- I recommend this so you have the original scan as well.
- You will be able to read the original scanned document (and in the future be able to do any work on it that is needed).
  - For example, if a new, even more accurate OCR program came out, you will be able to feed it this PDF.
- You can still search the document/copy/paste perfectly fine.

Here are comparisons of the book I am currently working on (Finereader 11):

Original (13.7 MB PDF). I assume this version was just fed through some Adobe OCR built into a scanner:

http://library.mises.org/books/Lione...evelopment.pdf

Text Under The Page (7.34 MB PDF):

http://www.mediafire.com/view/?9d2ft4bunocnkou

Text/Picture Only (802 KB PDF):

http://www.mediafire.com/view/?mue0znuycy55l9i

Text/Picture Only, No Embedded Fonts (591 KB PDF):

http://www.mediafire.com/view/?drb02hcjmxuwvbd

I decided to pick RTF since it can be saved both in "Editable" and "Formatted".

Here is an image comparing the Formatted/Editable output from Finereader:

http://www.imagebam.com/image/5d1a68238610245

Formatted RTF (1.30 MB RTF):

http://www.mediafire.com/view/?ce0jd2p1rzx0ibf

Editable RTF (1.32 MB RTF):

http://www.mediafire.com/view/?b42v1jk946uih76

In my testing between Adobe/Finereader, Finereader makes much smaller filesizes, AND has more accurate OCR.

In the original poster's case, I would still stick with my usual recommendation of, keeping the Original scan as a frontend, and having the OCRed text in the backend.

Quote:

Originally Posted by willus

I noticed with one document where I had red markings on an otherwise black-and-white document, "Compact PDF" stored the PDF in two layers--a black-and-white layer and a red layer, each one with very few bits per color.

That sounds like they do a fantastic job at making PDFs much smaller. I assume all of these scanners have their own little tiny proprietary tweaks to try to get their scanned PDFs smaller. Chopping out unused colors is one way to get the filesize way down. The book doesn't have all of the colors in the rainbow!

I personally just work with already scanned (mostly black and white) non-fiction books. Since there are only two colors, black, and white, you can imagine that they compress quite well.

Back to the OCR of documents, the auto-OCR on these scanners are ok (from what I have seen, many of these are based off of some sort of Adobe program), but if you look at the text, you can always see that there are the typical OCR errors.

I feel that an outside program (I use Finereader), will give you a much more accurate OCR than those that come bundled with the scanner. In my mind, a more accurate OCR = closer to the original book = a much more enjoyable reading experience.

My work is to convert the books into digital form (EPUB), so I need a nearly 100% correct conversion... and while I am at it, I can toss out that auto-OCRed stuff, and make a nearly 100% accurate PDF text backend as well.

On top of that, Finereader seems to have even better ways of making the PDFs smaller than those scanners. So I just see it as win-win-win-win-win.

DSpider · 02-22-2013, 06:17 AM

Quote:

Originally Posted by Tex2002ans

I personally just work with already scanned (mostly black and white) non-fiction books. Since there are only two colors, black, and white, you can imagine that they compress quite well.

"Black and white" can be different for many people... Some refer to it this way, even though it's actually grayscale, while other refer to it this way, even though it's actually a 1 bit image (black)! 1 bit images compress better. It's the same output that you would get from processing them with Scan Tailor.

willus · 02-22-2013, 07:57 AM

Of course, if you want to get really serious about book scanning, you can make yourself one of these.

slex · 02-22-2013, 08:02 AM

Quote:

Originally Posted by Tex2002ans

In Finereader 11:
...

Of course, Finereader is an excellent software, but it costs 129 euro.

02-13-2013, 06:10 PM	#1
shmendrapolk Member Posts: 10 Karma: 10 Join Date: May 2012 Device: kindle	Help optimizing scanned PDF Does anyone have experience with scanning a book and then optimizing it as a pdf file? Since I'm fortunate to have an undergraduate schlepper at work who does photocopies for us underpaid professors, I decided to try an ebook experiment today. I had her xerox an entire book. Our Xerox machine's sheet feeder will then allow you to email an entire set of pages to yourself in various formats. The book was 51 (double column) pages. I find that selecting "compact pdf" results in a file that's not to large but fully readible. So the resultant document is 2MB. I decided to run it through OCR software (Nitro 7) so I could have a document with searchable text. There are few images in the book and none on the pages that contain text. Here's where it starts to get confusing. I used the default settings "searchable text image". I ended up with a 60mb file. And I don't understand why. Why was is it 30x larger than the original? I then tried the alternative setting - "editable text". The resulting document looked the same except the few images and some artifacts were removed. But the file was still 7MB, considerably larger than the original. The only other thing I played around with was the "optimize pdf" feature - using the 7mb file. I removed the embedded fonts. I ended up with a 460kb file, that, near as I can tell, looks the same. I understand in principle what embedding fonts means - so that the doc will look exactly the same on all machines - but the book has few distinct fonts in it. So I'm a bit perplexed at how best to optimize a pdf. I want to keep the file sizes small, but I don't want to lose legibility. I see myself doing this a great deal in the future with books I get from Inter-library loan. It's far easier for my research to have them electronically, to read and annotate on my iPad. And having the text searchable is a major asset. The Nitro user guide is less than helpful. These aren't chemistry or economics textbooks, so there aren't flowcharts, pie graphs and what have you. They're mostly text.

02-14-2013, 03:59 AM	#2
DSpider Evangelist Posts: 450 Karma: 343115 Join Date: Nov 2009 Location: Romania Device: PW2 2014	My advice is NOT to scan them directly as PDF, but as images (preferably TIFF or PNG). Then run the images through Scan Tailor and then through ABBYY FineReader if you want some search functionality. This is the "quick and dirty" way. The "slow and of-an-exceptionally-high-quality way" would be to OCR it with ABBYY FineReader Professional, proofread the entire thing, process the graphics, track down the fonts, redo the layout in Word or InDesign, export as PDF (and maybe tweak a few things in Acrobat), and proofread again the final product. Not many people are willing to spend the time and effort for this, but the result is of very high quality. It's always a pleasure to read such a book. But first make sure that it's worth it, and that it's not already available as an e-book. Open-source alternatives: LibreOffice, GIMP, Scribus. Last edited by DSpider; 02-14-2013 at 05:09 AM.

02-22-2013, 07:57 AM	#13
willus Fuzzball, the purple cat Posts: 1,273 Karma: 11087488 Join Date: Jun 2011 Location: California Device: iPad	DIY scanning Of course, if you want to get really serious about book scanning, you can make yourself one of these.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Help with a scanned pdf file	Jach234	Amazon Kindle	5	04-16-2011 02:17 AM
scanned pdf	excalibra	PDF	5	04-08-2011 04:41 AM
Scanned pdf's issue	ululu	Sony Reader	1	11-18-2010 06:45 PM
Anyone have optimizing tips for PDF files?	stevet	PDF	4	03-05-2010 12:40 AM
PRS-600 Dictionary on scanned PDF?	antistar	Sony Reader	8	11-29-2009 03:05 PM

02-14-2013, 08:20 AM	#4
shmendrapolk Member Posts: 10 Karma: 10 Join Date: May 2012 Device: kindle	Thanks! It's for my own usage. So I don't really care if it doesn't look pretty and if the scan is let's say 95-98% accurate. i could deal with the occasional typo and I won't throw away the Xerox in case there is something I need to check it against when reading. Ultimately, my purpose here is convenience and time saving [You can skip this part as it's not about pdf optimization]: They are academic (humanities) books I use in my research. My usual process for using a book in research is: -Read the book and make little annotations near the relevant parts -Xerox only those pages that contain what I may need to quote and cite when writing -Scan them in to the PC as jpegs (or as a pdf) -Take notes in MS Word on the book including brief summaries about the specific passages I may need to cite and where to find them in the book. -If I saved them as jpegs then each jpeg will bear the name of its page number. I'm sure this sounds tedious to you, but trust me, when it came time to writing my dissertation (2006) having all my material scanned into the computer (and having two monitors) made life considerably easier. No stacks of papers spread out all over my floor; no serious time wasted transcribing hundreds of quotes, half of which I didn't end up using; and having all my material stored in a flash drive so I could write wherever I was. Obviously reading ebooks (as pdfs) on my iPad eliminates many of these steps. And it is so with the books I am able to find as ebooks. So it occurred to me to experiment by scanning one in in its entirety. Some things to note: -It's the xerox machine that sends it as a "compact pdf". It's one of the settings. What it does exactly I have no idea, but an otherwise 10-15mb file becomes less than 2mb if I select compact. i can see no difference in the results. And I had no trouble running the compact pdf through OCR. So my goal here is to (1) save time and make things more convenient (2) not end up with massive files (3) without sacrificing (or rather risking) reliability. Near as I can tell it's the embedded fonts on nitro that is adding the bloat - how else to explain 500kb instead of 7mb. 500kb sounds like a normal size for an ebook. But can someone explain the differences between "searchable text image," and "editable text" and what is at stake between choosing one over the other? And whether removing embedded fonts matters or not? Again this is all for myself. I'm not trying to created a pirated ebook to circulate. But I do need to be confident that it will look OK on multiple PCs and on future versions of Windows and iOS and what have you. I know a jpeg will never be an issue. But with these PDFs, I have no idea. It's about saving me time without wasting too much hdd space.

02-16-2013, 08:29 AM	#6
shmendrapolk Member Posts: 10 Karma: 10 Join Date: May 2012 Device: kindle	Thanks. The jpegs are actually quite useful once I've gone through the process of naming them according to the page number. Very easy to track down a quote because my notes reference a page number. And not having it OCRed in such cases isn't the biggest deal. The number of quotes I end up using are far fewer than the number I highlight while writing. But I would never want to have to read a whole book in such a manner. As for the embedded fonts and heir removal. I'm just imagining a hypothetical situation where many years down the road I'm on a different operating system, there's been some major technological changes. And I open up this PDF and the document won't render because some of he fonts or whatever no longer exist and I can't read it. The few times I've opened up a word doc on Pages on my iPad I've seen problems. And every time I try to open up something I wrote as an undergrad back in the early 90s (using MS Write or WordPerfect) the files are messed up. So a jpeg may be lossy and hard on the eyes, but I know it will always look exactly the same regardless of the environment.

02-16-2013, 12:11 PM	#7
DSpider Evangelist Posts: 450 Karma: 343115 Join Date: Nov 2009 Location: Romania Device: PW2 2014	If you're going the "image-only" route, at least process them with Scan Tailor and archive them as ZIP or RAR. Another thing that you can count on long-term compatibility is HTML, which is easy to edit and easy to convert, but you will need to proofread the whole thing (the OCR content) against the scanned images, at least once.

02-16-2013, 02:40 PM	#8
slex Addict Posts: 294 Karma: 1196776 Join Date: Nov 2008 Location: Bulgaria Device: Kindle 4 NT, Onyx Boox M92	Instead of Nitro try Pdf Xchange Viewer. From the menu choose Document-> OCR Pages and when the dialog shows up make sure to select for "PDF output type" "Preserve original content & add text layer". After the job is done, just save the file and you will have a slightly bigger pdf file. Note that if you choose the other option for "PDF output type", the file size increases significantly.