08-18-2010, 04:33 AM | #1 |
Enthusiast
Posts: 35
Karma: 10
Join Date: Jun 2008
Device: iPad, Macbook Pro, Kindle
|
small PDFs becoming huge LRFs when converted
Hi Kovid,
Running into something strange that I'm hoping you can help with. Most of the time, running the calibre Convert E-Books format makes files a LOT smaller when I go from most any other format to .LRF for my Sony Reader. Recently I've been adding some graphics intensive PDFs (the Osprey military history books) to my library, and they've had a huge growth in file size when converted for load. eg one PDF file The Gulf War 1991 by Alistair Finlan went from a 3.3 MB PDF to a 22.7 MB LRF when I hit the convert button. Any insights that you might be able to offer as to why this is happening and what I can do to fix it? Thanks a bunch for the great tool and for the help. |
08-18-2010, 04:52 AM | #2 |
Enthusiast
Posts: 35
Karma: 10
Join Date: Jun 2008
Device: iPad, Macbook Pro, Kindle
|
I think I may have found the problem.
It looks like a lot of the source file is being considered images rather than text, and PDF is really good at compressing images. Better than most other formats, so when I convert it, it gets bigger. I have the full version of Acrobat and I'm playing with the OCR functions to have it recognize the text as text, rather than the full page as an image. Betcha that's the problem. If so, I should be able to make these files many times smaller. Here's hoping... |
08-19-2010, 02:46 PM | #3 |
Enthusiast
Posts: 35
Karma: 10
Join Date: Jun 2008
Device: iPad, Macbook Pro, Kindle
|
Nope. OCRing the file makes them marginally smaller as PDFs, but they still get huge as LRFs.
Any suggestions? |
08-19-2010, 03:49 PM | #4 | |
Grand Sorcerer
Posts: 11,741
Karma: 6997045
Join Date: Jan 2010
Location: Notts, England
Device: Kobo Libra 2
|
Quote:
Try saving the OCRed PDF as text. That will get rid of the images. You could also try HTML. |
|
08-24-2010, 09:34 AM | #5 | |
Enthusiast
Posts: 35
Karma: 10
Join Date: Jun 2008
Device: iPad, Macbook Pro, Kindle
|
Quote:
/sigh this sucks, cause even though the PDFs are big 5 MB to 10 or 12 MB, the LRFs are Huuuuuuuuuge. |
|
08-24-2010, 10:32 AM | #6 | |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Quote:
Highlighting and pasting into a txt document shows the OCR'd text only. In my tests, the results were pretty bad. It was only marginally readable as pure OCR'd text. Headings in an italicized different font were completely unreadable. Some words were split up, etc. I suspect there is a site somewhere that will tell you how to remove all the text images, and replace them with the associated OCR'd true text. Ther muist be some way to do it. I hoped I'd find such a feature in Acrobat, but so far, no luck. Even if I found it, it would take a lot of work to get cleaned up. |
|
08-24-2010, 05:52 PM | #7 | |
Enthusiast
Posts: 35
Karma: 10
Join Date: Jun 2008
Device: iPad, Macbook Pro, Kindle
|
Quote:
Yep, I've had a similar experience. I tried 2 different things - one was really really bad, the other one worked marginally. In my first test I used Nuance Omnipage Professional. I have the older v16, so maybe the new v17 is better but I doubt it. This tool did an ok enough job OCRing. The problem was that it was really really dumb. It treated place names on a map as text to be OCRed and cleaned up. So you ended up with a copy where the text was fine and readable but the images were all mangled. Also in my tests Omnipage made the files between 2 and 6 times bigger while mangling them. Ouch. In my second test I did something I should have done to start off. I used Acrobat Standard edition to export the file as RTF, then used Scansoft PDF Create to convert it back to a PDF. When I then used the Reduce Size option in Acrobat Standard it shrank from an original 59 MB to a final 12.7 MB, which is a nice improvement even though still big.Final PDF looks real nice too, with one problem. The export to RTF step crops the right side of the page for some reason, probably related to page border settings, so I lost the right edge and a couple of letters on the right. Looks promising, but not quite there yet. |
|
08-24-2010, 09:10 PM | #8 |
Enthusiast
Posts: 35
Karma: 10
Join Date: Jun 2008
Device: iPad, Macbook Pro, Kindle
|
OK got a work around for this. It's not a huge reduction in size like you'd want, but it takes a 39 MB input and gives you an 11 MB output, which is OK.
Here's what you do. You'll need Adobe Acrobat Standard edition for this. Open the large PDF that was scanned as images and choose Save As. I used output to Doc because of the huge space savings over RTF. I also used the option to add tags. Then you'll see in the output that some of the full screen images are off the printable page. Handle this by opening the Doc file and choosing smaller margins than the 1" all around default. I used .75" but most anything in that range of margins will work fine. I saved the result out as .Docx for the additional space savings. Next issue, Calibre is a great tool, but it won't read Word Documents (.doc or .docx). I handled this by saving the result out as .HTML. Loaded the result into Calibre, added my meta data and hit convert and voila, the 39 MB source was now an 11 MB .LRF file. (the output format for your book reader goes here). Not perfect by a long shot. On my reader most books are 0.2 - 0.5 MB, and this one was 11.1, but a heck of a lot better than the huge files I was getting before. |
08-24-2010, 09:19 PM | #9 | |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
It's not clear to me what you're working around. In an earlier post, you indicated you wanted text, not images of text.
Quote:
|
|
08-25-2010, 04:07 AM | #10 |
Enthusiast
Posts: 35
Karma: 10
Join Date: Jun 2008
Device: iPad, Macbook Pro, Kindle
|
Sorry for being unclear.
Most e-books are in the 200k range. I have a batch I got that are in the form of PDFs that were scanned, and they're wayyyy larger than usual anywhere from 20 MB to 50 MB for the PDFs. I believe the extra size is due to the whole dang thing having been scanned as images rather than doing OCR and tagging at scan time. It gets even worse, because when I import them into Calibre they blow up to between 3 and 5 times the size of the already large PDF (sometimes way more than that). So what I've been trying to do is to shrink these already scanned PDFs into something usable on my book reader. I want to keep both the text and the various illustrations. Not just text only or the starting page images. To do this, I've tried several different things: 1) I tried using the Reduce File Size option in Acrobat and limiting compatibility to only the latest version. This helps some, but not a huge lot and when I import into Calibre the file size goes up hugely (into the 100 MB range for some files) 2) I tried using the OCR function in Acrobat on some of these files so it OCRs text inside of the book and knows what's text and what's images. It doesn't seem to do what I want. I think as you noted it tags text within the images and keeps the image. Not what I needed. 3) I tried a commercial OCR tool on the source files (Omnipage). It was horrible. It couldnt tell the difference between place names in a map, which should be kept as an image and not be OCRed and in line text which should be OCred. Also there were literally thousands of places it couldn't recognize words in a 100 page book. If you've ever seen v1.0 of a scanned document before the clean up you'll know what I mean. For OCR and recognizing images vs text Acrobat seems to do a far better job. 4) What seems to work for getting the file sizes down somewhat (about 1/3 of the starting PDF size, but still way bigger than other e-books) is to export the original document and generate tags for it on export, then convert to a format Calibre can use (I used html, but I'm sure others would be fine). This took a 39 MB source file and gave me an 11 MB LRF, vs the well over 100 MB and in some cases 300 MB + that I got from simply loading the PDF into Calibre. I still think I should be able to shrink the files far smaller than 11 MB, but at least I dont feel like my files are just exploding in size when loaded into Calibre. |
08-25-2010, 09:18 AM | #11 | ||||
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
No problem
Spoiler:
Everything above (it's just your post to this point) is what I thought you were doing. Quote:
Quote:
Quote:
Quote:
To ask this another way, after you're done (using method 4) do you still have each page as an image of that page? If it's still an image of the page, what image format is the image in? jpg? tiff? gif? I know it's embedded in an ebook format, but what's going on with the image? If it's not just an image of the page (i.e., an image of the text on a page) where did the OCR image->text conversion occur? |
||||
08-25-2010, 10:43 AM | #12 |
Enthusiast
Posts: 35
Karma: 10
Join Date: Jun 2008
Device: iPad, Macbook Pro, Kindle
|
You are correct sir.
Thanks for helping me with the troubleshooting process. What I thought was OCR happening may not have been. Dang if I know why the files reduced in size. I'd love to be able to get them down to normal e-book size of a few hundred KB You were quite correct that it looks like the pages are just images of 100-200 KB each. I confirmed this by going into acrobat and exporting images and as you can see below each page is just an image. Here's an example page http://i1025.photobucket.com/albums/...Image_0001.jpg In this example, there an image at the bottom, but the rest is plain text, so it looks to a lay person like me like the size of the document could be reduced dramatically by OCR in the appropriate places. When the file sizes shrank, I assumed that was what had happened, but now it doesn't look like it. Which I guess brings me back to a whole different question of is there a tool you recommend to do that, because it doesn't look like the ones that I've been trying are doing what I thought they did. |
08-25-2010, 02:06 PM | #13 | |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Quote:
The only way to get to what you want is to do OCR, clean up any OCR errors, then delete the images of text and leave the true images you want to save. I know how to do it all manually, but not automatically to produce anything I like. I'm reading this because I hoped you had a solution for me |
|
08-25-2010, 04:42 PM | #14 |
Enthusiast
Posts: 35
Karma: 10
Join Date: Jun 2008
Device: iPad, Macbook Pro, Kindle
|
Been pounding my head against OmniPage professional some more and I found a way to make it much less annoying.
What I've done so far is to export all the page images from Acrobat as JPGs, then import them into OmniPage. What was driving me crazy before was that it does a really bad job setting zones - that is identifying what part of each page is images and should not be OCRed, and what part should be OCRed. It's slow and tedious, but you can go into each page and manually draw the box for what part is text, graphics, forms or to be ignored. Then, unfortunately it asks you several zillion times if you'd like to change a word that it scanned correctly to a completely wrong one, plus a stack of places it scans to just plain gibberish. What's bugging me is I did all that work a couple of hours on one book and it doesn't look like it saved me any space. The raw JPGs are 16 MB, but the RTF output from OmniPage is up about 40 MB |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
converted to blank pdfs? | ciara_belle | Calibre | 12 | 07-16-2010 07:48 AM |
huge file after being converted | pennpin | Sony Reader | 2 | 04-28-2009 10:15 AM |
LRFs Super Slow | edbro | Calibre | 3 | 03-17-2009 08:29 AM |
LRFs from converted PDFs are blank | ShortNCuddlyAm | Calibre | 2 | 12-24-2008 07:44 PM |
Huge PDFs and scanned books | janosch | iRex | 3 | 09-19-2006 10:40 AM |