Thread: Searching NOT
View Single Post
Old 01-22-2014, 04:54 PM   #5
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
These are the three regex I use.

Regex #1: This catches all hyphenation at the end of the paragraphs. I replace this in a case by case basis just to make sure the combined word does not actually require the hyphen.

Code:
-</p>\s+<p>
Replace:

Code:
(NOTHING, just a complete blank)
Sample:

Code:
<p>This is a paragraph which has a hyphen-</p>
<p>ated paragraph.</p>
Regex #2: This catches every paragraph that ends in a character that is NOT a '>', '”' (right double quote), '?', '!', '.':

Code:
([^>”\?\!\.])</p>\s+<p>
Replace (make sure there is a SPACE afterwards):

Code:
\1
Sample:

Code:
<p>This is a sample,</p>
<p>of a paragraph that will</p>
<p>be caught by the regex above.</p>
Regex #3: This usually catches all paragraphs which were not combined by the above two, but begin with a lowercase letter (usually these should be combined, or there was a blockquote beforehand, or something odd in the text).

Code:
<p>[a-z]
Sample:

Code:
<p>In 2014, Tex gave an informative sample:</p>
<blockquote><p>Here is a quote.</p></blockquote>
<p>here is more information.</p>
These three tackle nearly all broken paragraphs in my experience.

Other oddities such as semi-colons or colons will be pointed out by Regex #2, and those can be fixed on a case-by-case basis. (Sometimes they should be combined, sometimes they should not).

I keep these three in my Sigil Saved Searches (Tools - Saved Searches). More info on how to use Saved Searches can be found here:

http://web.sigil.googlecode.com/git/..._searches.html
Tex2002ans is offline   Reply With Quote