View Full Version : Drastic pdf drop in quality of scanned book after using acrobat optimize feature


harryE123
12-29-2008, 06:33 PM
Let me describe my set up a bit: I use an opticbook 3600 book scanner, I scan my books to jpg at max quality and then import to acrobat. I use highest quality setting while merging them (instead of default, or the even lower optimize for space), then run an ocr through the file and finally an "optimize file" with the highest setting and all the default settings. Sizes for a 325 p book go like these: about 900 mb, then about 90 mb, then about 25 mb. BUT the difference in quality is striking between the ocr level and the optimization level, the book fonts really degrade after optimization, with letters missing tiny fractions of them everywhere...has anyone run across this? I mean it's understandable why there might be a slight (almost unoticable) drop in quality from after ocr from the 900 mb to the 90 mb file (although why ocr would reduce the size and drop the quality slightly is something I can't figure out, maybe it has to do with the pages no longer being images but text, hence more compact?), but this drop after optimization? Is there maybe a setting in the optimize dialog box that is the culpitt and I need to change?

I will post some comparison photos as well as the exat optimization setting I use (which are the default ones anyway).

pdurrant
12-30-2008, 05:07 AM
It seems to me that your OCR stage is not doing what you expect. I think it must be adding the ocred text to the file, not replacing the scans. The degredation in the image quality is just a result of over-compression of your scans.

If your scanned books were actually being turned into text instead of images, the size would drop to 1MB or less, not 25MB.

Paul

Let me describe my set up a bit: I use an opticbook 3600 book scanner, I scan my books to jpg at max quality and then import to acrobat. I use highest quality setting while merging them (instead of default, or the even lower optimize for space), then run an ocr through the file and finally an "optimize file" with the highest setting and all the default settings. Sizes for a 325 p book go like these: about 900 mb, then about 90 mb, then about 25 mb. BUT the difference in quality is striking between the ocr level and the optimization level, the book fonts really degrade after optimization, with letters missing tiny fractions of them everywhere...has anyone run across this? I mean it's understandable why there might be a slight (almost unoticable) drop in quality from after ocr from the 900 mb to the 90 mb file (although why ocr would reduce the size and drop the quality slightly is something I can't figure out, maybe it has to do with the pages no longer being images but text, hence more compact?), but this drop after optimization? Is there maybe a setting in the optimize dialog box that is the culpitt and I need to change?

I will post some comparison photos as well as the exat optimization setting I use (which are the default ones anyway).

DDHarriman
12-30-2008, 08:42 AM
Hi harryE123

The thing has to do with the type of OCR you are doing, so let me ask:
1 - what is the version of Acrobat are you using?
2 - when you do OCR in Acrobat, what option do you choose?

Using Acrobat 8 and choosing OCR Text Recognition and then Recognizing Text Using OCR one gets a default window where he can edit the settings.
If one does so, he gets 3 options, the important one is PDF Output Style, and from the 3 options one gets here, 2 produce one type of PDF and the past one (Formatted Text and Graphics) produces a different type of PDF.

So:

1 - the first 2 options (beginning with Searchable…) produce a PDF with 2 layers, the first layer is an image on the page, the second sits under the first, hidden from view and contains the text, positioned exactly in the same place where the “upper” image shows the text.
The result for the user is, he sees and image when viewing the PDF, but he can select the text and copy it, also he can (per example) find a word/frase in the document, etc…

2 - the last option, gives just one layer with text and images. These images are, real images in the original scan, like tables or photos, Acrobat could identify as images and all the letters Acrobat had doubts about. One can get rid of these doing a proof reading. This can be done by choosing again OCR Text Recognition and then Find First OCR Suspect (or find all), then one gets a window with the first one and a proposition for the text to be, one can accept or correct, once corrected the image is substituted by the correct letters, and Acrobat jumps to the next situation… and this goes on up to the end of the PDF.

Finally, there is the problem with images.
The Biggest impact optimizing has in the file is by compressing images, so if one has PDF’s (even by OCRing) of type (1), and lowers the resolution of the original up image layer, the file drops dramatically in size but at the cost of image quality. This happens also if one lowers the quality too… just like if one lowers the compression on a digitized photo.

harryE123
12-30-2008, 12:41 PM
Hey guys, much thanks for the replies.

@pdurrant: Yes I suppose that's what it's doing, as you say it is. That said the major degradation is between (after) ocr and the final product of optimization.

@ddharriman: 1. I am using Acrobat 8 Professional. 2. I have uploaded a screen capture of my settings, namely they are searchable image, english, and 600 dpi.

Reading the rest of your expemplary reply makes a lot of sense, I 've understood quite a few things. So, essentially if in ocr stage I choose 1,2 options and not 3 that will have an even worse impact in optimization (aka compression) since one layer of the document being a series of big images, optimizing them means degrading font quality. So I should rather aim for option 3 instead, sort out anything that was not recognized in the find suspect and then optimize (i.e. compress). (and then there's the reduced file size option which I have no idead what it does....) I have also attached the settings I use during optimization. :thanks: :xmas:

DDHarriman
12-30-2008, 04:44 PM
Basically correct.

But, sometimes it’s a question of trade in, in many businesses its’ acceptable levels of 99% accuracy instead of 100% in the OCR and/or one does not want to see bad recognized words/letters. In those circumstances it’s chosen one of the first 2 options.

The 3th gives also and directly the smaller file, as what you get mostly is just pure text (with some formatting) and not a bitmap image.

I also advise you to use the option Advanced, Pdf Optimizer instead of the option you are using, as here you get much better control to the output.
Also you have in this option in the up corner the button Audit space usage that permits you to see what is taking the space in your PDF.

harryE123
12-31-2008, 11:40 PM
Happy new year guys, thanks dd, for the replies, I got to tell you man, this is such hard work, I ll start reading an acrobat "bible" book because I am not happy with any of the results so far:

1. some books lend themselves to ocr better of course as anyone might imagine due to them carrying little graphs and pictures, but even these though they gain tremendously, as pdurrant mentioned above, in size I get "funny" fonts there too, most of the text is recognized ok, that is about roughly 97% or more, but the characters get a weird spacing such as this: "t his is a s a mple", of course the spacing isn't as pronounced in being erroneous but it's close.

2. When I apply optimization afterwards I get very weird results too, I try edge shadow removal (agressive) but it almost doesn't work at all, as far as I can tell none of the long solid slim black lines in the margins are recognised as such and removed and if I set the default settings to it I get a LARGER file after the optimization to an ocr book of about 200 pages (with ocr choice 3, not the bitmaps). How can that be. I 'll take your advice dd, and try the "Advanced optimization" but at the moment I struggle to find how this complements, surpasses or corresponds to the simple optimization.

3. With some books with a lot of pictures and graphs ocr type 3 choice is simply unacceptable.

4. One of my original queries still holds and is still very perplexing how by simply applying ocr type I or II I get a degraded quality, should just a sublayer be added, thus more size, but same quality bitmap, why do I get LESS quality there, and less size, it's not supposed to do anything on top of the bitmaps just adding the text sublayer...very strange.

5. The only thing (sigh...) that puts a smile on my face is applying the size reduction by making it compatible to acrobat 8 and above, which it figures just subtracts some levels of compatibility and gains in size which are adequate. This works well so far, no problems.

6. it's very confusing because there are so many things you gotta factor in one of them being the initial choice for quality when merging the original jpgs, should I aim for the highest then work my way down through the various optimizations and ocrs, or should I opt for the middle way, the second choice...

I didn't expect this whole process to be so hard and counter intuitive esp. with such an expensive product as acrobat...oh well...computers...I am computer scientist I should have expected that...

HAPPY NEW YEAR TO ALL, HAPPY READING, BEST OF HEALTH TO YOU AND YOUR LOVED ONES.