MobileRead Forums - View Single Post - PDF extraction

Elfwreck · 09-26-2009, 05:13 PM

Quote:

Originally Posted by orion2001

I posted in another thread regarding this, but you seem to have a lot of experience with PDF->Word conversions.

An insane amount. I've been working with PDF conversions for 10 years. (I still miss some features of Acrobat 4 that got dropped in later updates.) (Not that I want to go back. I just wish they'd change those few features.)

Quote:

You outlined a lot of postprocessing that you do. Does your convertor insert paragraph breaks at the end of a page even if a sentence is continued on the next? If so, do you go in and manually delete every spurious paragraph break for each page? I can't figure out if there is a software smart enough to not include these breaks at the end of a page, or if there is an easy way to correct for it.
Thanks!

Yes, it keeps the original page breaks, which means adding paragraph breaks in those spots. If it's short, I sometimes scroll through & manually remove the page breaks/paragraph breaks at the ends of each page.

Otherwise, I look for ways to identify paragraph breaks in the wrong places. This starts with removing unwanted page breaks; sometimes I remove them all (replace with a space); sometimes I try to keep them before chapter breaks, if chapter headers have identifiable typographical issues that I can search for.

Then: Search for [any letter]^p (or [any letter][space]^p), replace with [find what text]qqq, then replace ^pqqq with [space].

This doesn't work if some paragraphs are supposed to end with letters instead of punctuation (like tables), so it may involve some checking & manual touch-up. And it won't catch sentences that ended on one page, and the first line of the next page is supposed to be part of the same paragraph.

Sometimes I can search for tabs or indentation of first line--often, anything that's not indented is either a chapter header or should be part of the previous page. So, semi-manual: search, then manually fix.

It gets faster with practice. It's always a bit choppy, and never as good as a page-by-page QC, although I find it plenty acceptable for personal reading. Since most of the PDFs I convert this way are either not legal to distribute, or only of interest to a very limited crowd (I convert legal rulings from PDF to neatly-formatted Word docs for friends), I've not had to develop anything that works more smoothly.