These are the three regex I use.
Regex #1: This catches all hyphenation at the end of the paragraphs. I replace this in a case by case basis just to make sure the combined word does not actually require the hyphen.
Replace:
Code:
(NOTHING, just a complete blank)
Sample:
Code:
<p>This is a paragraph which has a hyphen-</p>
<p>ated paragraph.</p>
Regex #2: This catches every paragraph that ends in a character that is NOT a '>', '”' (right double quote), '?', '!', '.':
Code:
([^>”\?\!\.])</p>\s+<p>
Replace (make sure there is a SPACE afterwards):
Sample:
Code:
<p>This is a sample,</p>
<p>of a paragraph that will</p>
<p>be caught by the regex above.</p>
Regex #3: This usually catches all paragraphs which were not combined by the above two, but begin with a lowercase letter (usually these should be combined, or there was a blockquote beforehand, or something odd in the text).
Sample:
Code:
<p>In 2014, Tex gave an informative sample:</p>
<blockquote><p>Here is a quote.</p></blockquote>
<p>here is more information.</p>
These three tackle nearly all broken paragraphs in my experience.
Other oddities such as semi-colons or colons will be pointed out by Regex #2, and those can be fixed on a case-by-case basis. (Sometimes they should be combined, sometimes they should not).
I keep these three in my Sigil Saved Searches (Tools - Saved Searches). More info on how to use Saved Searches can be found here:
http://web.sigil.googlecode.com/git/..._searches.html