Another trick I would suggest, based on experience with a number of books I've cleaned up this way -- Don't try to do it all in one pass. Take an iterative approach. So, in the example I showed above, I would have then done a search for \w*\b</p>\n<p> (one or more alphanumeric characters, followed by a word boundary, an end of paragraph tag, a newline, and paragraph tag.) This would catch where my earlier substitution had created a line break between two words in a sentence, rather than at the end of a sentence.
Still not perfect, but by now I'm starting to create a readable book instead of one that's too annoying to bother.
|