MobileRead Forums - View Single Post

Tex2002ans · 01-22-2014, 05:54 PM

These are the three regex I use.

Regex #1: This catches all hyphenation at the end of the paragraphs. I replace this in a case by case basis just to make sure the combined word does not actually require the hyphen.

Code:

-</p>\s+<p>

Replace:

Code:

(NOTHING, just a complete blank)

Sample:

Code:

<p>This is a paragraph which has a hyphen-</p>
<p>ated paragraph.</p>

Regex #2: This catches every paragraph that ends in a character that is NOT a '>', '”' (right double quote), '?', '!', '.':

Code:

([^>”\?\!\.])</p>\s+<p>

Replace (make sure there is a SPACE afterwards):

Code:

\1

Sample:

Code:

<p>This is a sample,</p>
<p>of a paragraph that will</p>
<p>be caught by the regex above.</p>

Regex #3: This usually catches all paragraphs which were not combined by the above two, but begin with a lowercase letter (usually these should be combined, or there was a blockquote beforehand, or something odd in the text).

Code:

<p>[a-z]

Sample:

Code:

<p>In 2014, Tex gave an informative sample:</p>
<blockquote><p>Here is a quote.</p></blockquote>
<p>here is more information.</p>

These three tackle nearly all broken paragraphs in my experience.

Other oddities such as semi-colons or colons will be pointed out by Regex #2, and those can be fixed on a case-by-case basis. (Sometimes they should be combined, sometimes they should not).

I keep these three in my Sigil Saved Searches (Tools - Saved Searches). More info on how to use Saved Searches can be found here:

http://web.sigil.googlecode.com/git/..._searches.html

01-22-2014, 05:54 PM	#5
Tex2002ans Wizard Posts: 2,306 Karma: 13057279 Join Date: Jul 2012 Device: Kobo Forma, Nook	These are the three regex I use. Regex #1: This catches all hyphenation at the end of the paragraphs. I replace this in a case by case basis just to make sure the combined word does not actually require the hyphen. Code: -</p>\s+<p> Replace: Code: (NOTHING, just a complete blank) Sample: Code: <p>This is a paragraph which has a hyphen-</p> <p>ated paragraph.</p> Regex #2: This catches every paragraph that ends in a character that is NOT a '>', '”' (right double quote), '?', '!', '.': Code: ([^>”\?\!\.])</p>\s+<p> Replace (make sure there is a SPACE afterwards): Code: \1 Sample: Code: <p>This is a sample,</p> <p>of a paragraph that will</p> <p>be caught by the regex above.</p> Regex #3: This usually catches all paragraphs which were not combined by the above two, but begin with a lowercase letter (usually these should be combined, or there was a blockquote beforehand, or something odd in the text). Code: <p>[a-z] Sample: Code: <p>In 2014, Tex gave an informative sample:</p> <blockquote><p>Here is a quote.</p></blockquote> <p>here is more information.</p> These three tackle nearly all broken paragraphs in my experience. Other oddities such as semi-colons or colons will be pointed out by Regex #2, and those can be fixed on a case-by-case basis. (Sometimes they should be combined, sometimes they should not). I keep these three in my Sigil Saved Searches (Tools - Saved Searches). More info on how to use Saved Searches can be found here: http://web.sigil.googlecode.com/git/..._searches.html