small PDFs becoming huge LRFs when converted

Timber · 08-18-2010, 04:33 AM

Hi Kovid,

Running into something strange that I'm hoping you can help with.

Most of the time, running the calibre Convert E-Books format makes files a LOT smaller when I go from most any other format to .LRF for my Sony Reader.

Recently I've been adding some graphics intensive PDFs (the Osprey military history books) to my library, and they've had a huge growth in file size when converted for load. eg one PDF file The Gulf War 1991 by Alistair Finlan went from a 3.3 MB PDF to a 22.7 MB LRF when I hit the convert button.

Any insights that you might be able to offer as to why this is happening and what I can do to fix it?

Thanks a bunch for the great tool and for the help.

Timber · 08-18-2010, 04:52 AM

I think I may have found the problem.

It looks like a lot of the source file is being considered images rather than text, and PDF is really good at compressing images. Better than most other formats, so when I convert it, it gets bigger.

I have the full version of Acrobat and I'm playing with the OCR functions to have it recognize the text as text, rather than the full page as an image.

Betcha that's the problem. If so, I should be able to make these files many times smaller. Here's hoping...

Timber · 08-19-2010, 02:46 PM

Nope. OCRing the file makes them marginally smaller as PDFs, but they still get huge as LRFs.

Any suggestions?

chaley · 08-19-2010, 03:49 PM

Quote:

Originally Posted by Timber

Nope. OCRing the file makes them marginally smaller as PDFs, but they still get huge as LRFs.

I think that Acrobat's OCR leaves the images, associating the text with the characters they come from in some overlay fashion. This is why you can sometimes search text in PDFs that are obviously images. There was a thread sometime back about Greek characters in documents that demonstrated this. When looking at the PDF, one saw greek, but ebooks made using the OCRed text had garbage in the same spot.

Try saving the OCRed PDF as text. That will get rid of the images. You could also try HTML.

Timber · 08-24-2010, 09:34 AM

Quote:

Originally Posted by chaley

I think that Acrobat's OCR leaves the images, associating the text with the characters they come from in some overlay fashion. This is why you can sometimes search text in PDFs that are obviously images. There was a thread sometime back about Greek characters in documents that demonstrated this. When looking at the PDF, one saw greek, but ebooks made using the OCRed text had garbage in the same spot.

Try saving the OCRed PDF as text. That will get rid of the images. You could also try HTML.

Yep except that there are a lot of images in the book and I really want to keep the ones that are actually images in the original. I just want text to be treated as text so I dont end up with 50 MB+ .LRF files.

/sigh this sucks, cause even though the PDFs are big 5 MB to 10 or 12 MB, the LRFs are Huuuuuuuuuge.

Starson17 · 08-24-2010, 10:32 AM

Quote:

Originally Posted by Timber

there are a lot of images in the book and I really want to keep the ones that are actually images in the original. I just want text to be treated as text so I dont end up with 50 MB+ .LRF files.

/sigh this sucks, cause even though the PDFs are big 5 MB to 10 or 12 MB, the LRFs are Huuuuuuuuuge.

I finally had a chance to try the OCR in Acrobat. As chaley says, it leaves multiple tiny images of the text, so the result as a pdf is highly readable - all you see are the original images of the text.

Highlighting and pasting into a txt document shows the OCR'd text only. In my tests, the results were pretty bad. It was only marginally readable as pure OCR'd text. Headings in an italicized different font were completely unreadable. Some words were split up, etc.

I suspect there is a site somewhere that will tell you how to remove all the text images, and replace them with the associated OCR'd true text. Ther muist be some way to do it. I hoped I'd find such a feature in Acrobat, but so far, no luck. Even if I found it, it would take a lot of work to get cleaned up.

Timber · 08-24-2010, 05:52 PM

Quote:

Originally Posted by Starson17

I finally had a chance to try the OCR in Acrobat. As chaley says, it leaves multiple tiny images of the text, so the result as a pdf is highly readable - all you see are the original images of the text.

Highlighting and pasting into a txt document shows the OCR'd text only. In my tests, the results were pretty bad. It was only marginally readable as pure OCR'd text. Headings in an italicized different font were completely unreadable. Some words were split up, etc.

I suspect there is a site somewhere that will tell you how to remove all the text images, and replace them with the associated OCR'd true text. Ther muist be some way to do it. I hoped I'd find such a feature in Acrobat, but so far, no luck. Even if I found it, it would take a lot of work to get cleaned up.

Thanks Starson,

Yep, I've had a similar experience.

I tried 2 different things - one was really really bad, the other one worked marginally.

In my first test I used Nuance Omnipage Professional. I have the older v16, so maybe the new v17 is better but I doubt it.

This tool did an ok enough job OCRing. The problem was that it was really really dumb. It treated place names on a map as text to be OCRed and cleaned up. So you ended up with a copy where the text was fine and readable but the images were all mangled. Also in my tests Omnipage made the files between 2 and 6 times bigger while mangling them. Ouch.

In my second test I did something I should have done to start off. I used Acrobat Standard edition to export the file as RTF, then used Scansoft PDF Create to convert it back to a PDF. When I then used the Reduce Size option in Acrobat Standard it shrank from an original 59 MB to a final 12.7 MB, which is a nice improvement even though still big.

Final PDF looks real nice too, with one problem. The export to RTF step crops the right side of the page for some reason, probably related to page border settings, so I lost the right edge and a couple of letters on the right.

Looks promising, but not quite there yet.

Timber · 08-24-2010, 09:10 PM

OK got a work around for this. It's not a huge reduction in size like you'd want, but it takes a 39 MB input and gives you an 11 MB output, which is OK.

Here's what you do. You'll need Adobe Acrobat Standard edition for this. Open the large PDF that was scanned as images and choose Save As. I used output to Doc because of the huge space savings over RTF. I also used the option to add tags.

Then you'll see in the output that some of the full screen images are off the printable page. Handle this by opening the Doc file and choosing smaller margins than the 1" all around default. I used .75" but most anything in that range of margins will work fine. I saved the result out as .Docx for the additional space savings.

Next issue, Calibre is a great tool, but it won't read Word Documents (.doc or .docx). I handled this by saving the result out as .HTML. Loaded the result into Calibre, added my meta data and hit convert and voila, the 39 MB source was now an 11 MB .LRF file. (the output format for your book reader goes here).

Not perfect by a long shot. On my reader most books are 0.2 - 0.5 MB, and this one was 11.1, but a heck of a lot better than the huge files I was getting before.

Starson17 · 08-24-2010, 09:19 PM

Quote:

Originally Posted by Timber

OK got a work around for this.

It's not clear to me what you're working around. In an earlier post, you indicated you wanted text, not images of text.

Quote:

Open the large PDF that was scanned as images and choose Save As. I used output to Doc because of the huge space savings over RTF. I also used the option to add tags.

The reason I'm confused is that you don't seem to be doing any OCR, so if you start with scanned images of each page, don't do any OCR and save as .doc, don't you just get images of pages inside a .doc file?

Timber · 08-25-2010, 04:07 AM

Sorry for being unclear.

Most e-books are in the 200k range. I have a batch I got that are in the form of PDFs that were scanned, and they're wayyyy larger than usual anywhere from 20 MB to 50 MB for the PDFs.

I believe the extra size is due to the whole dang thing having been scanned as images rather than doing OCR and tagging at scan time.

It gets even worse, because when I import them into Calibre they blow up to between 3 and 5 times the size of the already large PDF (sometimes way more than that).

So what I've been trying to do is to shrink these already scanned PDFs into something usable on my book reader.

I want to keep both the text and the various illustrations. Not just text only or the starting page images.

To do this, I've tried several different things:

1) I tried using the Reduce File Size option in Acrobat and limiting compatibility to only the latest version. This helps some, but not a huge lot and when I import into Calibre the file size goes up hugely (into the 100 MB range for some files)

2) I tried using the OCR function in Acrobat on some of these files so it OCRs text inside of the book and knows what's text and what's images. It doesn't seem to do what I want. I think as you noted it tags text within the images and keeps the image. Not what I needed.

3) I tried a commercial OCR tool on the source files (Omnipage). It was horrible. It couldnt tell the difference between place names in a map, which should be kept as an image and not be OCRed and in line text which should be OCred. Also there were literally thousands of places it couldn't recognize words in a 100 page book. If you've ever seen v1.0 of a scanned document before the clean up you'll know what I mean. For OCR and recognizing images vs text Acrobat seems to do a far better job.

4) What seems to work for getting the file sizes down somewhat (about 1/3 of the starting PDF size, but still way bigger than other e-books) is to export the original document and generate tags for it on export, then convert to a format Calibre can use (I used html, but I'm sure others would be fine). This took a 39 MB source file and gave me an 11 MB LRF, vs the well over 100 MB and in some cases 300 MB + that I got from simply loading the PDF into Calibre.

I still think I should be able to shrink the files far smaller than 11 MB, but at least I dont feel like my files are just exploding in size when loaded into Calibre.

Starson17 · 08-25-2010, 09:18 AM

Quote:

Originally Posted by Timber

Sorry for being unclear.

No problem

Spoiler:

Everything above (it's just your post to this point) is what I thought you were doing.

Quote:

To do this, I've tried several different things:

1) I tried using the Reduce File Size option in Acrobat and limiting compatibility to only the latest version. This helps some, but not a huge lot and when I import into Calibre the file size goes up hugely (into the 100 MB range for some files)

Right. You still have the huge scanned images of each page.

Quote:

2) I tried using the OCR function in Acrobat on some of these files so it OCRs text inside of the book and knows what's text and what's images. It doesn't seem to do what I want. I think as you noted it tags text within the images and keeps the image. Not what I needed.

Right. It's just added the OCR'd text to everything else.

Quote:

3) I tried a commercial OCR tool on the source files (Omnipage). It was horrible. It couldnt tell the difference between place names in a map, which should be kept as an image and not be OCRed and in line text which should be OCred. Also there were literally thousands of places it couldn't recognize words in a 100 page book. If you've ever seen v1.0 of a scanned document before the clean up you'll know what I mean. For OCR and recognizing images vs text Acrobat seems to do a far better job.

Right - I read your post about this.

Quote:

4) What seems to work for getting the file sizes down somewhat (about 1/3 of the starting PDF size, but still way bigger than other e-books) is to export the original document and generate tags for it on export, then convert to a format Calibre can use (I used html, but I'm sure others would be fine). This took a 39 MB source file and gave me an 11 MB LRF, vs the well over 100 MB and in some cases 300 MB + that I got from simply loading the PDF into Calibre.

OK, but don't you still end up with images of each page? I thought the point was to get to reflowing text, not keep the original images of text. .... Or did I misunderstand your goal?

To ask this another way, after you're done (using method 4) do you still have each page as an image of that page? If it's still an image of the page, what image format is the image in? jpg? tiff? gif? I know it's embedded in an ebook format, but what's going on with the image? If it's not just an image of the page (i.e., an image of the text on a page) where did the OCR image->text conversion occur?

Timber · 08-25-2010, 10:43 AM

You are correct sir.

Thanks for helping me with the troubleshooting process. What I thought was OCR happening may not have been. Dang if I know why the files reduced in size. I'd love to be able to get them down to normal e-book size of a few hundred KB

You were quite correct that it looks like the pages are just images of 100-200 KB each. I confirmed this by going into acrobat and exporting images and as you can see below each page is just an image.

Here's an example page http://i1025.photobucket.com/albums/...Image_0001.jpg

In this example, there an image at the bottom, but the rest is plain text, so it looks to a lay person like me like the size of the document could be reduced dramatically by OCR in the appropriate places.

When the file sizes shrank, I assumed that was what had happened, but now it doesn't look like it.

Which I guess brings me back to a whole different question of is there a tool you recommend to do that, because it doesn't look like the ones that I've been trying are doing what I thought they did.

Starson17 · 08-25-2010, 02:06 PM

Quote:

Originally Posted by Timber

What I thought was OCR happening may not have been. Dang if I know why the files reduced in size. I'd love to be able to get them down to normal e-book size of a few hundred KB

Now we are on the same wavelength. You still have images of pages. Your processing may have reduced resolution or changed the image compression in some way. Either could have reduced file size.

The only way to get to what you want is to do OCR, clean up any OCR errors, then delete the images of text and leave the true images you want to save. I know how to do it all manually, but not automatically to produce anything I like. I'm reading this because I hoped you had a solution for me

Timber · 08-25-2010, 04:42 PM

Been pounding my head against OmniPage professional some more and I found a way to make it much less annoying.

What I've done so far is to export all the page images from Acrobat as JPGs, then import them into OmniPage.

What was driving me crazy before was that it does a really bad job setting zones - that is identifying what part of each page is images and should not be OCRed, and what part should be OCRed.

It's slow and tedious, but you can go into each page and manually draw the box for what part is text, graphics, forms or to be ignored.

Then, unfortunately it asks you several zillion times if you'd like to change a word that it scanned correctly to a completely wrong one, plus a stack of places it scans to just plain gibberish.

What's bugging me is I did all that work a couple of hours on one book and it doesn't look like it saved me any space. The raw JPGs are 16 MB, but the RTF output from OmniPage is up about 40 MB

08-18-2010, 04:33 AM	#1
Timber Enthusiast Posts: 35 Karma: 10 Join Date: Jun 2008 Device: iPad, Macbook Pro, Kindle	small PDFs becoming huge LRFs when converted Hi Kovid, Running into something strange that I'm hoping you can help with. Most of the time, running the calibre Convert E-Books format makes files a LOT smaller when I go from most any other format to .LRF for my Sony Reader. Recently I've been adding some graphics intensive PDFs (the Osprey military history books) to my library, and they've had a huge growth in file size when converted for load. eg one PDF file The Gulf War 1991 by Alistair Finlan went from a 3.3 MB PDF to a 22.7 MB LRF when I hit the convert button. Any insights that you might be able to offer as to why this is happening and what I can do to fix it? Thanks a bunch for the great tool and for the help.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
converted to blank pdfs?	ciara_belle	Calibre	12	07-16-2010 07:48 AM
huge file after being converted	pennpin	Sony Reader	2	04-28-2009 10:15 AM
LRFs Super Slow	edbro	Calibre	3	03-17-2009 08:29 AM
LRFs from converted PDFs are blank	ShortNCuddlyAm	Calibre	2	12-24-2008 07:44 PM
Huge PDFs and scanned books	janosch	iRex	3	09-19-2006 10:40 AM

08-18-2010, 04:52 AM	#2
Timber Enthusiast Posts: 35 Karma: 10 Join Date: Jun 2008 Device: iPad, Macbook Pro, Kindle	I think I may have found the problem. It looks like a lot of the source file is being considered images rather than text, and PDF is really good at compressing images. Better than most other formats, so when I convert it, it gets bigger. I have the full version of Acrobat and I'm playing with the OCR functions to have it recognize the text as text, rather than the full page as an image. Betcha that's the problem. If so, I should be able to make these files many times smaller. Here's hoping...

08-19-2010, 02:46 PM	#3
Timber Enthusiast Posts: 35 Karma: 10 Join Date: Jun 2008 Device: iPad, Macbook Pro, Kindle	Nope. OCRing the file makes them marginally smaller as PDFs, but they still get huge as LRFs. Any suggestions?

08-24-2010, 09:10 PM	#8
Timber Enthusiast Posts: 35 Karma: 10 Join Date: Jun 2008 Device: iPad, Macbook Pro, Kindle	OK got a work around for this. It's not a huge reduction in size like you'd want, but it takes a 39 MB input and gives you an 11 MB output, which is OK. Here's what you do. You'll need Adobe Acrobat Standard edition for this. Open the large PDF that was scanned as images and choose Save As. I used output to Doc because of the huge space savings over RTF. I also used the option to add tags. Then you'll see in the output that some of the full screen images are off the printable page. Handle this by opening the Doc file and choosing smaller margins than the 1" all around default. I used .75" but most anything in that range of margins will work fine. I saved the result out as .Docx for the additional space savings. Next issue, Calibre is a great tool, but it won't read Word Documents (.doc or .docx). I handled this by saving the result out as .HTML. Loaded the result into Calibre, added my meta data and hit convert and voila, the 39 MB source was now an 11 MB .LRF file. (the output format for your book reader goes here). Not perfect by a long shot. On my reader most books are 0.2 - 0.5 MB, and this one was 11.1, but a heck of a lot better than the huge files I was getting before.

08-25-2010, 04:07 AM	#10
Timber Enthusiast Posts: 35 Karma: 10 Join Date: Jun 2008 Device: iPad, Macbook Pro, Kindle	Sorry for being unclear. Most e-books are in the 200k range. I have a batch I got that are in the form of PDFs that were scanned, and they're wayyyy larger than usual anywhere from 20 MB to 50 MB for the PDFs. I believe the extra size is due to the whole dang thing having been scanned as images rather than doing OCR and tagging at scan time. It gets even worse, because when I import them into Calibre they blow up to between 3 and 5 times the size of the already large PDF (sometimes way more than that). So what I've been trying to do is to shrink these already scanned PDFs into something usable on my book reader. I want to keep both the text and the various illustrations. Not just text only or the starting page images. To do this, I've tried several different things: 1) I tried using the Reduce File Size option in Acrobat and limiting compatibility to only the latest version. This helps some, but not a huge lot and when I import into Calibre the file size goes up hugely (into the 100 MB range for some files) 2) I tried using the OCR function in Acrobat on some of these files so it OCRs text inside of the book and knows what's text and what's images. It doesn't seem to do what I want. I think as you noted it tags text within the images and keeps the image. Not what I needed. 3) I tried a commercial OCR tool on the source files (Omnipage). It was horrible. It couldnt tell the difference between place names in a map, which should be kept as an image and not be OCRed and in line text which should be OCred. Also there were literally thousands of places it couldn't recognize words in a 100 page book. If you've ever seen v1.0 of a scanned document before the clean up you'll know what I mean. For OCR and recognizing images vs text Acrobat seems to do a far better job. 4) What seems to work for getting the file sizes down somewhat (about 1/3 of the starting PDF size, but still way bigger than other e-books) is to export the original document and generate tags for it on export, then convert to a format Calibre can use (I used html, but I'm sure others would be fine). This took a 39 MB source file and gave me an 11 MB LRF, vs the well over 100 MB and in some cases 300 MB + that I got from simply loading the PDF into Calibre. I still think I should be able to shrink the files far smaller than 11 MB, but at least I dont feel like my files are just exploding in size when loaded into Calibre.

08-25-2010, 10:43 AM	#12
Timber Enthusiast Posts: 35 Karma: 10 Join Date: Jun 2008 Device: iPad, Macbook Pro, Kindle	You are correct sir. Thanks for helping me with the troubleshooting process. What I thought was OCR happening may not have been. Dang if I know why the files reduced in size. I'd love to be able to get them down to normal e-book size of a few hundred KB You were quite correct that it looks like the pages are just images of 100-200 KB each. I confirmed this by going into acrobat and exporting images and as you can see below each page is just an image. Here's an example page http://i1025.photobucket.com/albums/...Image_0001.jpg In this example, there an image at the bottom, but the rest is plain text, so it looks to a lay person like me like the size of the document could be reduced dramatically by OCR in the appropriate places. When the file sizes shrank, I assumed that was what had happened, but now it doesn't look like it. Which I guess brings me back to a whole different question of is there a tool you recommend to do that, because it doesn't look like the ones that I've been trying are doing what I thought they did.

08-25-2010, 04:42 PM	#14
Timber Enthusiast Posts: 35 Karma: 10 Join Date: Jun 2008 Device: iPad, Macbook Pro, Kindle	Been pounding my head against OmniPage professional some more and I found a way to make it much less annoying. What I've done so far is to export all the page images from Acrobat as JPGs, then import them into OmniPage. What was driving me crazy before was that it does a really bad job setting zones - that is identifying what part of each page is images and should not be OCRed, and what part should be OCRed. It's slow and tedious, but you can go into each page and manually draw the box for what part is text, graphics, forms or to be ignored. Then, unfortunately it asks you several zillion times if you'd like to change a word that it scanned correctly to a completely wrong one, plus a stack of places it scans to just plain gibberish. What's bugging me is I did all that work a couple of hours on one book and it doesn't look like it saved me any space. The raw JPGs are 16 MB, but the RTF output from OmniPage is up about 40 MB