MobileRead Forums - View Single Post - small PDFs becoming huge LRFs when converted

Timber · 08-25-2010, 04:07 AM

Sorry for being unclear.

Most e-books are in the 200k range. I have a batch I got that are in the form of PDFs that were scanned, and they're wayyyy larger than usual anywhere from 20 MB to 50 MB for the PDFs.

I believe the extra size is due to the whole dang thing having been scanned as images rather than doing OCR and tagging at scan time.

It gets even worse, because when I import them into Calibre they blow up to between 3 and 5 times the size of the already large PDF (sometimes way more than that).

So what I've been trying to do is to shrink these already scanned PDFs into something usable on my book reader.

I want to keep both the text and the various illustrations. Not just text only or the starting page images.

To do this, I've tried several different things:

1) I tried using the Reduce File Size option in Acrobat and limiting compatibility to only the latest version. This helps some, but not a huge lot and when I import into Calibre the file size goes up hugely (into the 100 MB range for some files)

2) I tried using the OCR function in Acrobat on some of these files so it OCRs text inside of the book and knows what's text and what's images. It doesn't seem to do what I want. I think as you noted it tags text within the images and keeps the image. Not what I needed.

3) I tried a commercial OCR tool on the source files (Omnipage). It was horrible. It couldnt tell the difference between place names in a map, which should be kept as an image and not be OCRed and in line text which should be OCred. Also there were literally thousands of places it couldn't recognize words in a 100 page book. If you've ever seen v1.0 of a scanned document before the clean up you'll know what I mean. For OCR and recognizing images vs text Acrobat seems to do a far better job.

4) What seems to work for getting the file sizes down somewhat (about 1/3 of the starting PDF size, but still way bigger than other e-books) is to export the original document and generate tags for it on export, then convert to a format Calibre can use (I used html, but I'm sure others would be fine). This took a 39 MB source file and gave me an 11 MB LRF, vs the well over 100 MB and in some cases 300 MB + that I got from simply loading the PDF into Calibre.

I still think I should be able to shrink the files far smaller than 11 MB, but at least I dont feel like my files are just exploding in size when loaded into Calibre.

08-25-2010, 04:07 AM	#10
Timber Enthusiast Posts: 35 Karma: 10 Join Date: Jun 2008 Device: iPad, Macbook Pro, Kindle	Sorry for being unclear. Most e-books are in the 200k range. I have a batch I got that are in the form of PDFs that were scanned, and they're wayyyy larger than usual anywhere from 20 MB to 50 MB for the PDFs. I believe the extra size is due to the whole dang thing having been scanned as images rather than doing OCR and tagging at scan time. It gets even worse, because when I import them into Calibre they blow up to between 3 and 5 times the size of the already large PDF (sometimes way more than that). So what I've been trying to do is to shrink these already scanned PDFs into something usable on my book reader. I want to keep both the text and the various illustrations. Not just text only or the starting page images. To do this, I've tried several different things: 1) I tried using the Reduce File Size option in Acrobat and limiting compatibility to only the latest version. This helps some, but not a huge lot and when I import into Calibre the file size goes up hugely (into the 100 MB range for some files) 2) I tried using the OCR function in Acrobat on some of these files so it OCRs text inside of the book and knows what's text and what's images. It doesn't seem to do what I want. I think as you noted it tags text within the images and keeps the image. Not what I needed. 3) I tried a commercial OCR tool on the source files (Omnipage). It was horrible. It couldnt tell the difference between place names in a map, which should be kept as an image and not be OCRed and in line text which should be OCred. Also there were literally thousands of places it couldn't recognize words in a 100 page book. If you've ever seen v1.0 of a scanned document before the clean up you'll know what I mean. For OCR and recognizing images vs text Acrobat seems to do a far better job. 4) What seems to work for getting the file sizes down somewhat (about 1/3 of the starting PDF size, but still way bigger than other e-books) is to export the original document and generate tags for it on export, then convert to a format Calibre can use (I used html, but I'm sure others would be fine). This took a 39 MB source file and gave me an 11 MB LRF, vs the well over 100 MB and in some cases 300 MB + that I got from simply loading the PDF into Calibre. I still think I should be able to shrink the files far smaller than 11 MB, but at least I dont feel like my files are just exploding in size when loaded into Calibre.