View Full Version : PDF extraction what is the best tool?


Prospect
03-12-2008, 09:11 AM
When converting PDFs to MobiPocket for my Cybook I have so far used MobiPocket Creator and Adobe Acrobat v6.

I think that the best result is archived if I export the PDF to HTML from Acrobat and then convert the HTML file to .prc using MopiPocket Creator, instead of converting directly from PDF to .prc in MobiPocket creator.

As far as I know I could also use BookDesigner for this task.

The conversion is never perfect and there are always issues with formatting.

How do you extract your PDFs? What is the best current tool/process? Will I archive better results if I update Acrobat to the latest version?

RWood
03-12-2008, 10:04 AM
For me the best tool is ABBYY PDF Transformer. As I remember it is about $99. It creates MS Word documents that can be edited and loaded into BD.

wgrimm
03-29-2008, 11:55 AM
When converting PDFs to MobiPocket for my Cybook I have so far used MobiPocket Creator and Adobe Acrobat v6.

I think that the best result is archived if I export the PDF to HTML from Acrobat and then convert the HTML file to .prc using MopiPocket Creator, instead of converting directly from PDF to .prc in MobiPocket creator.



I have tried a great many software packages for PDF conversion. I own the latest Adobe Acrobat, and it is one of my least favorites for this task. DocUnPDF works very well (Mac and Win versions available, $60 or so) and will output in many formats including html and lrf. The company's reps are pretty nice, and will work with customers- for example, issuing license codes for 2 installations of the software when you buy it, so you can have one install at home and one at work.

My other favorite is Gemini, by Iceni, a British software company. Its output from pdf to html is the best I have seen, but you do pay a price- $159 when I bought it. It's absolutely top-of-the-line.

Gemini and UnPDF are the only 2 softwares out there I would recommend for this task.

wallcraft
03-29-2008, 01:15 PM
DocUnPDF works very well I can't find DocUnPDF. Did you mean deskUNPDF?

Prospect
03-29-2008, 02:41 PM
I downloaded the demo version of Gemini and I agree that it is a great tool that works better than both Adobe Acrobate and Mobipocket Creator.

Thanks!

wgrimm
03-29-2008, 06:40 PM
I can't find DocUnPDF. Did you mean deskUNPDF?

Sorry, you are right. They have an upgrade offer now, and a couple of bundle specials.

tomsheeley
04-14-2008, 08:04 PM
I have had great luck with the different converters from ABC Amber ( http://www.processtext.com/ ).

They have a PDF converter that convert to almost any format you can think of - for only $12.95.

I use the companies " MS Lit" converter almost every day, as so many ebooks are released in Lit format , which my Palm TX can not read.

Hope it helps!

JSWolf
04-24-2008, 07:45 PM
ABC Amber Lit converter doesn't work well as it's based on a buggy version of ConvertLIT.

DDHarriman
04-25-2008, 03:17 PM
Acrobat pro 8.0 (export as text and ou word) and Omnipage pro 16 (OCR the PDF file and save as text or word).

WillAdams
04-25-2008, 05:12 PM
The best tool I've found for this is Marcel Weiher's TextLightning.app available from www.metaobject.com (ob. discl. I was a beta-tester). Although it's a Mac OS X app, it's available for Linux and could probably be compiled for Windows using the recently improved support for Windows GNUstep www.gnustep.org affords.

William

stilliremain
08-20-2009, 09:10 AM
LRF conversion seems to have been removed from docudesk unpdf professional version 3.0? Can anyone confirm? I've downloaded trials of 2 and 3 and this seems to be the case...

Christina789
08-24-2009, 06:54 AM
When converting PDFs to MobiPocket for my Cybook I have so far used MobiPocket Creator and Adobe Acrobat v6.

I think that the best result is archived if I export the PDF to HTML from Acrobat and then convert the HTML file to .prc using MopiPocket Creator, instead of converting directly from PDF to .prc in MobiPocket creator.

As far as I know I could also use BookDesigner for this task.

The conversion is never perfect and there are always issues with formatting.

How do you extract your PDFs? What is the best current tool/process? Will I archive better results if I update Acrobat to the latest version?

When I only need to extract some test content from a PDF file. I use the freeware AnyBizSoft PDF to Text. It extracts text from PDF files.
Since you want to retain the format, I think converting PDF to Word or HTML, and then to .prc could be a choice. Anyway, I think there will be problems with formatting once a file is converted for 2 or more times with different tools.

Elfwreck
08-24-2009, 09:46 PM
I extract PDFs to Word docs (or RTF; the file is the same from my viewpoint), and then edit the Word doc. If I were more fluent in HTML, I'd extract to that--and expect spend the same amount of time editing the HTML file as I spend on the average PDF-to-Word conversion.

I generally have to fix the page sizes & margins, remove text boxes, change pictures to inline with text, and do odd things to get rid of the page numbers & headers. Then I fix the paragraph settings starting by making them all single-spaced, and removing the right & left margin indents if any; if it's reasonable, I change them all to the same before & after amounts and justification. Then I set the font--make it all one font, use find & replace to fix the sizes, make sure it's all 100% size, not condensed or expanded.

I'd expect HTML files to work better if the fonts were normalized, remove the extra "div" sections and "align" tags, get rid of tables that force the page structure.

Basic novels should transfer nicely. Of course, basic novels probably transfer fine from the original PDF straight to Mobi. It's when there are other formatting aspects that the conversion breaks down, and none of the auto-converters shines as the best one, because PDF wasn't designed to be a convert-from format.

orion2001
09-26-2009, 04:12 PM
Hi Elfwreck,

I posted in another thread regarding this, but you seem to have a lot of experience with PDF->Word conversions. You outlined a lot of postprocessing that you do. Does your convertor insert paragraph breaks at the end of a page even if a sentence is continued on the next? If so, do you go in and manually delete every spurious paragraph break for each page? I can't figure out if there is a software smart enough to not include these breaks at the end of a page, or if there is an easy way to correct for it.
Thanks!

Elfwreck
09-26-2009, 06:13 PM
I posted in another thread regarding this, but you seem to have a lot of experience with PDF->Word conversions.

An insane amount. I've been working with PDF conversions for 10 years. (I still miss some features of Acrobat 4 that got dropped in later updates.) (Not that I want to go back. I just wish they'd change those few features.)

You outlined a lot of postprocessing that you do. Does your convertor insert paragraph breaks at the end of a page even if a sentence is continued on the next? If so, do you go in and manually delete every spurious paragraph break for each page? I can't figure out if there is a software smart enough to not include these breaks at the end of a page, or if there is an easy way to correct for it.
Thanks!

Yes, it keeps the original page breaks, which means adding paragraph breaks in those spots. If it's short, I sometimes scroll through & manually remove the page breaks/paragraph breaks at the ends of each page.

Otherwise, I look for ways to identify paragraph breaks in the wrong places. This starts with removing unwanted page breaks; sometimes I remove them all (replace with a space); sometimes I try to keep them before chapter breaks, if chapter headers have identifiable typographical issues that I can search for.

Then: Search for [any letter]^p (or [any letter][space]^p), replace with [find what text]qqq, then replace ^pqqq with [space].

This doesn't work if some paragraphs are supposed to end with letters instead of punctuation (like tables), so it may involve some checking & manual touch-up. And it won't catch sentences that ended on one page, and the first line of the next page is supposed to be part of the same paragraph.

Sometimes I can search for tabs or indentation of first line--often, anything that's not indented is either a chapter header or should be part of the previous page. So, semi-manual: search, then manually fix.

It gets faster with practice. It's always a bit choppy, and never as good as a page-by-page QC, although I find it plenty acceptable for personal reading. Since most of the PDFs I convert this way are either not legal to distribute, or only of interest to a very limited crowd (I convert legal rulings from PDF to neatly-formatted Word docs for friends), I've not had to develop anything that works more smoothly.

orion2001
09-26-2009, 08:07 PM
Thanks a lot ElfWreck! Actually, I spent some more time trying to learn about Regular Expressions (used by most text editors for Search and Replace) and I ended up doing this:

Converted PDF -> HTML

Now all the unwanted mid sentence pagebreaks are basically those that look like *</p> where * is some character other than a period (since a period indicates end of sentence and probably end of para). I used Komodo Edit (http://www.activestate.com/komodo_edit/) which is a free and powerful text/html editor to then open the HTML file. Then I used the Edit->Replace Feature (Ctrl-H) and entered the following:

(Make sure the following boxes are checked: Regex, Multiline and Replace)

Enter the following in the section - Find what:
([^\.'"!?:\)])</span></p>
<p><span class=font3>


Enter the following in the section - Replace with:
\1

(Note: \1 above actually has a space after the 1)

In my particular HTML file, paragraps end as </span></p> and then
<p><span class=font3> would start the next para.

What the Regex expression above does is only find those paragraph breaks that do not have a (. , !, ), ?, : ) character just preceeding the paragraph break (since those would indicate complete sentences and probably the end of a legit paragraph).

Now you can keep using the Find and Replace feature to rapidly cycle through all instances to find these fault paragraph breaks. If it is indeed faulty, you just hit replace and the whole code section is replaced by a space thus bridging the broken sentence together. This ended up working really well and I managed to fix a 300 page (300 PDF page ie) book in 10 minutes! You can actually go even faster if you just use the Replace All feature although you might end up taking out a couple of legit paragraph breaks (for example, some paragraph ending sentences might end with a comma or some other character not being checked for).

Note- You can easily change the non expression part of the code above to modify it depending on how the paragraph end and start code is in your particular HTML file.

I hope this makes sense. It worked really well for me. I also find that HTML makes doing all the little tweaks really quick and painless.

Also, I'd like to mention that for some reason this Regex Expression refused to work on Notepad++ for me (which is why I moved to Komodo). If anyone can get it to work on Notepad++ do let me know.

Cheers

orion2001
09-26-2009, 10:50 PM
Just a further update regarding Notepad++

Turns out that it isn't capable of using Regexp with multiline searches (as in this case). You can only use multi-line searches in "Extended mode" but you cant use regular expressions in that mode. I think this coupled with the lack of secure-ftp integrated in Notepad++ is going to make me move entirely to using Komodo as my text editor of choice.

orion2001
09-26-2009, 10:58 PM
Then: Search for [any letter]^p (or [any letter][space]^p), replace with [find what text]qqq, then replace ^pqqq with [space].



If you don't mind, could you explain this to me? I'm not sure what the ^p and the ^pqqq refer to. I'm a bit of a formatting noob :).

Elfwreck
09-27-2009, 01:33 AM
If you don't mind, could you explain this to me? I'm not sure what the ^p and the ^pqqq refer to. I'm a bit of a formatting noob :).

Not knowing those doesn't mean you're a formatting noob; it means you don't use Microsoft Word for formatting. Word's find-and-replace functions use ^ to indicate a non-keyboard character. So ^p is "paragraph break;" ^t is "tab;" ^$ is "any letter;" ^? is "any character;" ^b is "section break;" ^m is "manual page break." (There are more, but there's no need for anyone to learn them; they're part of Word's dropdown menus in the find-and-replace dialog box.)

I use "qqq" as a substitute sequence for multi-stage find-and-replace functions, because Word's abilities are limited. It can find "[any letter][paragraph break]" but doesn't allow "replace the paragraph part of that with a space."

It can format or replace the entire search string, or add something to the beginning or end of it. So I add qqq to the end of it, and then search for "[paragraph break]qqq" and replace *that* with a space.

I use it because qqq is exceedingly unlikely to be repeated anywhere in the body of the book, and I won't accidentally replace real text that way.

I am almost entirely clueless about HTML. I gather the principles are about the same as what I usually do in Word, but I'd have to learn a whole new set of keywords and search options. (Which I should do.) I have Kompozer, and occasionally have tried to work with it. It's confusing, and Word is not, because I have lots of practice with Word and none with HTML editors. (I suspect that Semagic doesn't count as an HTML editor. Most of what I know about HTML, I learned by posting at LiveJournal.)

What the Regex expression above does is only find those paragraph breaks that do not have a (. , !, ), ?, : ) character just preceeding the paragraph break (since those would indicate complete sentences and probably the end of a legit paragraph).

I'd add mdashes to that list. And quotation marks.

Same basic principle I use, except Word doesn't have a way to "find all X that don't match Trait Y," nor a way to "find all X with trait A, or B, or C." Much less "find all X that don't match trait A, B, or C." However, it does have "find any letter" separate from "any character" or "any digit." (Does not have "any punctuation.")

The biggest problem working with Word is that the HTML output is atrocious; it has to be ported into something else & converted to be useful to anything other than Frontpage websites. Word 97 had okay HTML output. But you lose a lot of features using the old versions of Word.

orion2001
09-27-2009, 01:41 AM
Thanks! That is very useful. I use Word, but I hate it when it comes time for rigorous formatting. I am currently in the middle of writing my doctorate thesis in Word and I am not having any fun :D. It works OK most of the time but every now and then it does something silly and it is a huge pain hacking at it till I can fix it. I wish I could use LaTeX but my advisor is a MS junkie. Anyways, I think both our approaches ended up being the same albeit via different tools. I am now trying to learn BD as it seems like a very useful tool for creating the final ebook. Thanks again!

Cheers

Elfwreck
09-27-2009, 02:16 AM
Thanks! That is very useful. I use Word, but I hate it when it comes time for rigorous formatting. I am currently in the middle of writing my doctorate thesis in Word and I am not having any fun :D. It works OK most of the time but every now and then it does something silly and it is a huge pain hacking at it till I can fix it. I wish I could use LaTeX but my advisor is a MS junkie.

You could try making it in LaTex, output to PDF, and converting that to Word. The tables would probably have to be reformatted, and the actual formatting would be atrocious from a desktop publishing perspective (footnotes would be loose text at the bottom of the page, not linked to their numbers), but it'd probably *look* right.

If you need it more correctly formatted than that, you could use Open Office, which is similar to Word in structure & workflow but less full of MS's peculiar approach to some formatting concepts. (And free. And if teacher complains, tell him not everyone can afford Microsoft Office.)

orion2001
09-27-2009, 02:34 AM
Heh, thanks but I don't think that will work. I'm going to have to do a back and forth with my files and we use features like track comments/changes, etc to work on manuscripts. It would end up being too much of an hassle. In addition I use EndNote for my bibliography (and so does he) which would also cause problems. Lastly, he pays for Word and Endnote licenses so I can't quite argue on the monetary front :)