MobileRead Forums - View Single Post - How can I fix it when every line is a paragraph?

eschwartz · 12-23-2014, 12:09 PM

Mystery solved.

Tex2002ans, your regexes don't account for classes, how could you!

Fixed:

Quote:

Regex #2:

Search: -</p>\s+<p( [^<>]+)?>
Replace:

Explanation: What this will do is remove hyphens at the very end of the "paragraph", and combine it with the next line.

Side Note: I use the above regex on a one-by-one, case-by-case basis, because many "soft hyphens" in the PDF aren't actually a part of the word.

Example:

Code:

<p>Blah blah blah govern-</p>
<p>ment.</p>

Code:

<p>Blah blah blah government.</p>

Regex #2 (Variant):

Search: -</p>\s+<p( [^<>]+)?>
Replace: -

Note: I don't use this one, although if there are TONS of hyphens at the end of each line, it might be best to do it this way, and take care of the hyphen situation on your own at a later step. I personally prefer to use the Spell Check Tool, and search for a single hyphen by itself: '-'. This will give you a list of every single word with a hyphen in it. Then I can check for + fix mistakes there much more quickly.

Example:

Code:

<p>Blah blah blah govern-</p>
<p>ment.</p>

Code:

<p>Blah blah blah govern-ment.</p>

Regex #3:

Search: ([^>”\?\!\.])</p>\s+<p( [^<>]+)?>
Replace: \1

Explanation: What this Regex will do, is search for a paragraph that DOES NOT end in a "greater than sign", "right double quote", "question mark", "exclamation point", or "period". It will then combine it with the next paragraph.

Note: There is a space after the "\1".

Example:

Code:

<p>Susie said</p>
<p>that she was going to jump over a tree.</p>
<p>She also said,</p>
<p>that this was just a sample.</p>

Code:

<p>Susie said that she was going to jump over a tree.</p>
<p>She also said, that this was just a sample.</p>