View Single Post
Old 09-26-2009, 05:13 PM   #15
Elfwreck
Grand Sorcerer
Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.
 
Elfwreck's Avatar
 
Posts: 5,142
Karma: 24387852
Join Date: Nov 2008
Location: SF Bay Area, California, USA
Device: Clié; PRS-505; EZR Pocket Pro, PRS-600, Kobo Mini
Quote:
Originally Posted by orion2001 View Post
I posted in another thread regarding this, but you seem to have a lot of experience with PDF->Word conversions.
An insane amount. I've been working with PDF conversions for 10 years. (I still miss some features of Acrobat 4 that got dropped in later updates.) (Not that I want to go back. I just wish they'd change those few features.)

Quote:
You outlined a lot of postprocessing that you do. Does your convertor insert paragraph breaks at the end of a page even if a sentence is continued on the next? If so, do you go in and manually delete every spurious paragraph break for each page? I can't figure out if there is a software smart enough to not include these breaks at the end of a page, or if there is an easy way to correct for it.
Thanks!
Yes, it keeps the original page breaks, which means adding paragraph breaks in those spots. If it's short, I sometimes scroll through & manually remove the page breaks/paragraph breaks at the ends of each page.

Otherwise, I look for ways to identify paragraph breaks in the wrong places. This starts with removing unwanted page breaks; sometimes I remove them all (replace with a space); sometimes I try to keep them before chapter breaks, if chapter headers have identifiable typographical issues that I can search for.

Then: Search for [any letter]^p (or [any letter][space]^p), replace with [find what text]qqq, then replace ^pqqq with [space].

This doesn't work if some paragraphs are supposed to end with letters instead of punctuation (like tables), so it may involve some checking & manual touch-up. And it won't catch sentences that ended on one page, and the first line of the next page is supposed to be part of the same paragraph.

Sometimes I can search for tabs or indentation of first line--often, anything that's not indented is either a chapter header or should be part of the previous page. So, semi-manual: search, then manually fix.

It gets faster with practice. It's always a bit choppy, and never as good as a page-by-page QC, although I find it plenty acceptable for personal reading. Since most of the PDFs I convert this way are either not legal to distribute, or only of interest to a very limited crowd (I convert legal rulings from PDF to neatly-formatted Word docs for friends), I've not had to develop anything that works more smoothly.

Last edited by Elfwreck; 09-26-2009 at 05:15 PM.
Elfwreck is offline   Reply With Quote