View Single Post
Old 08-25-2010, 09:18 AM   #11
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by Timber View Post
Sorry for being unclear.
No problem

Spoiler:
Quote:
Most e-books are in the 200k range. I have a batch I got that are in the form of PDFs that were scanned, and they're wayyyy larger than usual anywhere from 20 MB to 50 MB for the PDFs.

I believe the extra size is due to the whole dang thing having been scanned as images rather than doing OCR and tagging at scan time.

It gets even worse, because when I import them into Calibre they blow up to between 3 and 5 times the size of the already large PDF (sometimes way more than that).

So what I've been trying to do is to shrink these already scanned PDFs into something usable on my book reader.

I want to keep both the text and the various illustrations. Not just text only or the starting page images.

Everything above (it's just your post to this point) is what I thought you were doing.

Quote:
To do this, I've tried several different things:

1) I tried using the Reduce File Size option in Acrobat and limiting compatibility to only the latest version. This helps some, but not a huge lot and when I import into Calibre the file size goes up hugely (into the 100 MB range for some files)
Right. You still have the huge scanned images of each page.

Quote:
2) I tried using the OCR function in Acrobat on some of these files so it OCRs text inside of the book and knows what's text and what's images. It doesn't seem to do what I want. I think as you noted it tags text within the images and keeps the image. Not what I needed.
Right. It's just added the OCR'd text to everything else.

Quote:
3) I tried a commercial OCR tool on the source files (Omnipage). It was horrible. It couldnt tell the difference between place names in a map, which should be kept as an image and not be OCRed and in line text which should be OCred. Also there were literally thousands of places it couldn't recognize words in a 100 page book. If you've ever seen v1.0 of a scanned document before the clean up you'll know what I mean. For OCR and recognizing images vs text Acrobat seems to do a far better job.
Right - I read your post about this.

Quote:
4) What seems to work for getting the file sizes down somewhat (about 1/3 of the starting PDF size, but still way bigger than other e-books) is to export the original document and generate tags for it on export, then convert to a format Calibre can use (I used html, but I'm sure others would be fine). This took a 39 MB source file and gave me an 11 MB LRF, vs the well over 100 MB and in some cases 300 MB + that I got from simply loading the PDF into Calibre.
OK, but don't you still end up with images of each page? I thought the point was to get to reflowing text, not keep the original images of text. .... Or did I misunderstand your goal?

To ask this another way, after you're done (using method 4) do you still have each page as an image of that page? If it's still an image of the page, what image format is the image in? jpg? tiff? gif? I know it's embedded in an ebook format, but what's going on with the image? If it's not just an image of the page (i.e., an image of the text on a page) where did the OCR image->text conversion occur?
Starson17 is offline   Reply With Quote