MobileRead Forums - View Single Post - small PDFs becoming huge LRFs when converted

Starson17 · 08-25-2010, 09:18 AM

Quote:

Originally Posted by Timber

Sorry for being unclear.

No problem

Spoiler:

Everything above (it's just your post to this point) is what I thought you were doing.

Quote:

To do this, I've tried several different things:

1) I tried using the Reduce File Size option in Acrobat and limiting compatibility to only the latest version. This helps some, but not a huge lot and when I import into Calibre the file size goes up hugely (into the 100 MB range for some files)

Right. You still have the huge scanned images of each page.

Quote:

2) I tried using the OCR function in Acrobat on some of these files so it OCRs text inside of the book and knows what's text and what's images. It doesn't seem to do what I want. I think as you noted it tags text within the images and keeps the image. Not what I needed.

Right. It's just added the OCR'd text to everything else.

Quote:

3) I tried a commercial OCR tool on the source files (Omnipage). It was horrible. It couldnt tell the difference between place names in a map, which should be kept as an image and not be OCRed and in line text which should be OCred. Also there were literally thousands of places it couldn't recognize words in a 100 page book. If you've ever seen v1.0 of a scanned document before the clean up you'll know what I mean. For OCR and recognizing images vs text Acrobat seems to do a far better job.

Right - I read your post about this.

Quote:

4) What seems to work for getting the file sizes down somewhat (about 1/3 of the starting PDF size, but still way bigger than other e-books) is to export the original document and generate tags for it on export, then convert to a format Calibre can use (I used html, but I'm sure others would be fine). This took a 39 MB source file and gave me an 11 MB LRF, vs the well over 100 MB and in some cases 300 MB + that I got from simply loading the PDF into Calibre.

OK, but don't you still end up with images of each page? I thought the point was to get to reflowing text, not keep the original images of text. .... Or did I misunderstand your goal?

To ask this another way, after you're done (using method 4) do you still have each page as an image of that page? If it's still an image of the page, what image format is the image in? jpg? tiff? gif? I know it's embedded in an ebook format, but what's going on with the image? If it's not just an image of the page (i.e., an image of the text on a page) where did the OCR image->text conversion occur?