View Single Post
Old 12-23-2014, 12:09 PM   #18
eschwartz
Ex-Helpdesk Junkie
eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.
 
eschwartz's Avatar
 
Posts: 19,421
Karma: 85400180
Join Date: Nov 2012
Location: The Beaten Path, USA, Roundworld, This Side of Infinity
Device: Kindle Touch fw5.3.7 (Wifi only)
Mystery solved.

Tex2002ans, your regexes don't account for classes, how could you!

Fixed:
Quote:
Regex #2:

Search: -</p>\s+<p( [^<>]+)?>
Replace:

Explanation: What this will do is remove hyphens at the very end of the "paragraph", and combine it with the next line.

Side Note: I use the above regex on a one-by-one, case-by-case basis, because many "soft hyphens" in the PDF aren't actually a part of the word.

Example:

Code:
<p>Blah blah blah govern-</p>
<p>ment.</p>
Code:
<p>Blah blah blah government.</p>
Regex #2 (Variant):

Search: -</p>\s+<p( [^<>]+)?>
Replace: -

Note: I don't use this one, although if there are TONS of hyphens at the end of each line, it might be best to do it this way, and take care of the hyphen situation on your own at a later step. I personally prefer to use the Spell Check Tool, and search for a single hyphen by itself: '-'. This will give you a list of every single word with a hyphen in it. Then I can check for + fix mistakes there much more quickly.

Example:

Code:
<p>Blah blah blah govern-</p>
<p>ment.</p>
Code:
<p>Blah blah blah govern-ment.</p>
Regex #3:

Search: ([^>”\?\!\.])</p>\s+<p( [^<>]+)?>
Replace: \1

Explanation: What this Regex will do, is search for a paragraph that DOES NOT end in a "greater than sign", "right double quote", "question mark", "exclamation point", or "period". It will then combine it with the next paragraph.

Note: There is a space after the "\1".

Example:

Code:
<p>Susie said</p>
<p>that she was going to jump over a tree.</p>
<p>She also said,</p>
<p>that this was just a sample.</p>
Code:
<p>Susie said that she was going to jump over a tree.</p>
<p>She also said, that this was just a sample.</p>
eschwartz is offline   Reply With Quote