View Single Post
Old 05-29-2026, 08:59 AM   #3
ElMiko
Fanatic
ElMiko has read every ebook posted at MobileReadElMiko has read every ebook posted at MobileReadElMiko has read every ebook posted at MobileReadElMiko has read every ebook posted at MobileReadElMiko has read every ebook posted at MobileReadElMiko has read every ebook posted at MobileReadElMiko has read every ebook posted at MobileReadElMiko has read every ebook posted at MobileReadElMiko has read every ebook posted at MobileReadElMiko has read every ebook posted at MobileReadElMiko has read every ebook posted at MobileRead
 
ElMiko's Avatar
 
Posts: 570
Karma: 65460
Join Date: Jun 2011
Device: Kindle Voyage, Boox Go 7
I have a similar approach, given that the start of the next paragraph is less determinative than the end of the preceding one with regards to rejoining incorrectly broken paragraphs.

But the search above looks like it would have thousands of false positives. Basically it's going to match the end of every single paragraph in the book. You might as well just search for </p>. This seems... inefficient, no?

FWIW, I use:

Code:
([a-z]|[a-z]-|,|(?<!nbsp|&#160);|,”|[MD][rs]\.|Mrs\.|\b[AI]|(”|—|</i>)(?=</p>\s+<p[^>]*?>[a-z]))
</p>\s+<p[^>]*?>
and replace it with

Code:
\1(followed by a space)
The gobbledygood above basically is looking for paragraphs that end in:
  • any lowercase letter
  • any lowercase letter that is followed by a hyphen
  • a comma
  • a semicolon
  • a closing curly quote preceded by a comma
  • Honorifics (Dr., Mr., Ms., etc.)
  • Single capital letters that are also words (A and I)
  • closing curly quotes, closing <i> tags, and em dashes that are followed by a paragraph that begins in a lower case letter.

This will not catch everything, obviously. And it'll rejoin things like verse which should not be rejoined. Which is why to JSWolf's overstated point in the other thread, any kind of automation like this needs to be combined with a quick visual page-by-page scan of the orginal doc (for example, a pdf), to find stragglers (mostly, where the last line of a page ends in a terminal punctuation, but it isn't actually the end of the paragraph) and to un-join verse and other idiosyncratically formatted blocks.

And some errors are just going to be unavoidable without an absurd (and unhelpful) level of obsessiveness... but this is true of physical media as well.

PS - this is also paired with dozens of other searches, some of which can help quickly identify other cases of incorrectly broken paragraphs, such as searching for quotations that haven't been appropriately closed. e.g.:

Code:
“Watch out! Stay away from there! It's not safe.

Stay close to me,” he said.

Last edited by ElMiko; 05-30-2026 at 06:18 AM.
ElMiko is offline   Reply With Quote