Thread: Regex examples
View Single Post
Old 11-17-2021, 08:36 PM   #689
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by patrik View Post
Often after using Finereader for OCR, some paragraphs are split into two.

Like:

<p>This is a journey</p>

<p>into sound.</p>

which should be: <p>This is a journey into sound.</p>
For ~9 years, I've been using 3 "join" regexes. They catch the ~99% of broken paragraphs, but they have to be decided on a case-by-case basis.

Here's a PM I wrote a few months ago with examples:

* * *

The 3 main "joins" I currently use:

Search: -</p>\s+<p>
Replace: <--- (Completely blank)

and:

Search: ([^>”\?\!\.])</p>\s+<p>
Replace: \1 <---- (There's a space after the '1')

and:

Search: <p>[a-z]
Replace: <---- (BLANK. Only use for FINDING, NOT REPLACING.)

1st one looks for a hyphen at the end of a paragraph:

Code:
<p>This is an ex-</p>
<p>ample.</p>
2nd one looks for any paragraph that ends in a NOT closing punctuation:

Code:
<p>This is an</p>
<p>example.</p>

<p>This is a list of one,</p>
<p>two, and three.</p>
and 3rd one looks for any leftover paragraphs STARTING with a lowercase letter:

Code:
<blockquote>
	<p>This is a long quote.</p>
</blockquote>

<p>apples, Bananas, Pears...</p>

<p>and Croutons.</p>
Those 3 should catch 99% of the broken paragraphs.

- - -

Usage Note: Then you just have to pay close attention to paragraphs that end in ':', because those all depend on the book/context:

Code:
<p>This is a list:</p>
<p>One, Two, Three</p>

<p>This is a quote:</p>
<p>“Get over here!”</p>
These could be:

Code:
<p>This is a list: One, Two, Three</p>

<p>This is a quote: “Get over here!”</p>
(More in-depth regex might also be needed for ”</p> too, but I don't have any Saved Searches on that. Very rarely do I see those actually get split by Finereader. And usually the "lowercase regex" catches all those.)

- - -

Regex #1 Note: You have to be careful, DO NOT "REPLACE ALL".

Not all hyphens are "soft hyphens". Some need to be replaced with an actual hyphen:

Code:
The proto-</p>

<p>European model of [...]
would need to become:

Code:
The proto-European model of [...]
It's up to you when/how you want to deal with these. You can:

1. Deal with "soft" or "hard" hyphens on a case-by-case basis as you go through book. (I find the vast majority out of Finereader are "soft", so 90%+ of the time I want hyphen gone.)

2. Replace all broken paragraphs with a "hard" hyphen, then remove bad/inconsistent hyphens at a later stage:

(Regex #1 alt)

Search: -</p>\s+<p>
Replace: -

This would get you:

Code:
<p>This is an ex-</p>
<p>ample.</p>

<p>This is an ex-ample.</p>
Back in 2013, I wrote how to use "Spellcheck Lists" to catch bad/inconsistent hyphenation:

2013: "How do you deal with soft hyphens in OCR texts?"

Personally, I squash everything one-by-one during cleanup.

Finereader tends to introduce issues at page/line breaks (leaving footnotes in the text, tables smack-dab in the middle of split paragraphs, etc.), so this case-by-case hyphen fixing is also a great time to spot/correct those issues!

And then when I get to the Spellcheck List stage, all hyphens can be mass checked/corrected. And since the leftover hyphens there are correct OR actual hyphenation errors that snuck into the book, this is much easier.

Quote:
Originally Posted by patrik View Post
But sometimes Finereader adds table-stuff:

<p>This is a journey</p>
<table border="1">
<tbody>
<tr>
<td></td>

<td>
<p>into sound.</p>

which the regex catches and destroys the table.
Back in 2020, I partially wrote about my "12-step Finereader Cleanup" (Sigil Saved Searches).

Here's the last 5 steps of my Saved Searches dealing with Finereader tables:

Remove Finereader 12 Table Alignment
Search: <td style="vertical-align:[^"]+">
Replace: <td>

Clean Bold td
Search: <td>\s+<p><span class="bold">([^<]+)</span></p>\s+</td>
Replace: <td>\1</td>

Clean Italics td
Search: <td>\s+<p>(<span class="italics">[^<]+</span>)</p>\s+</td>
Replace: <td>\1</td>

Clean td
Search: <td>\s+<p>([^<]+)</p>\s+</td>
Replace: <td>\1</td>

Clean Table Headers
Search: <td colspan="([0-9]+)">\s+<p>([^<]+)</p>\s+</td>
Replace: <th colspan="\1">\2</th>

* * *

For ~9 years, I've had those 12 steps stored in my Sigil Saved Searches. 99% of the Finereader HTML cruft is cleaned up and normalized.

Then I could open up a Finereader EPUB, run the group of searches, and within seconds... boom... clean code to use as a base.

Here's an example of an Archive.org book I generated through Finereader PDF -> EPUB -> 12-step cleanup:

Seconds to create that EPUB. And compared to the automatically generated "EPUB" version hosted on Archive.org, mine blows it away.

Quote:
Originally Posted by patrik View Post
Any way to catch only what is safe to replace? (And catch where the last or first letter is a valid word with capital letter (like "I")?)
Test out my 3 regexes. You'll be pleasantly surprised at how well it works.

Last edited by Tex2002ans; 11-18-2021 at 02:47 PM.
Tex2002ans is offline   Reply With Quote