MobileRead Forums - View Single Post

Tex2002ans · 11-17-2021, 08:36 PM

Quote:

Originally Posted by patrik

Often after using Finereader for OCR, some paragraphs are split into two.

Like:

This is a journey

into sound.

which should be: This is a journey into sound.

For ~9 years, I've been using 3 "join" regexes. They catch the ~99% of broken paragraphs, but they have to be decided on a case-by-case basis.

Here's a PM I wrote a few months ago with examples:

* * *

The 3 main "joins" I currently use:

Search: -\s+
Replace: <--- (Completely blank)

and:

Search: ([^>”\?\!\.])\s+
Replace: \1 <---- (There's a space after the '1')

and:

Search: [a-z]
Replace: <---- (BLANK. Only use for FINDING, NOT REPLACING.)

1st one looks for a hyphen at the end of a paragraph:

Code:

<p>This is an ex-</p>
<p>ample.</p>

2nd one looks for any paragraph that ends in a NOT closing punctuation:

Code:

<p>This is an</p>
<p>example.</p>

<p>This is a list of one,</p>
<p>two, and three.</p>

and 3rd one looks for any leftover paragraphs STARTING with a lowercase letter:

Code:

<blockquote>
	<p>This is a long quote.</p>
</blockquote>

<p>apples, Bananas, Pears...</p>

<p>and Croutons.</p>

Those 3 should catch 99% of the broken paragraphs.

- - -

Usage Note: Then you just have to pay close attention to paragraphs that end in ':', because those all depend on the book/context:

Code:

<p>This is a list:</p>
<p>One, Two, Three</p>

<p>This is a quote:</p>
<p>“Get over here!”</p>

These could be:

Code:

<p>This is a list: One, Two, Three</p>

<p>This is a quote: “Get over here!”</p>

(More in-depth regex might also be needed for ” too, but I don't have any Saved Searches on that. Very rarely do I see those actually get split by Finereader. And usually the "lowercase regex" catches all those.)

- - -

Regex #1 Note: You have to be careful, DO NOT "REPLACE ALL".

Not all hyphens are "soft hyphens". Some need to be replaced with an actual hyphen:

Code:

The proto-</p>

<p>European model of [...]

would need to become:

Code:

The proto-European model of [...]

It's up to you when/how you want to deal with these. You can:

1. Deal with "soft" or "hard" hyphens on a case-by-case basis as you go through book. (I find the vast majority out of Finereader are "soft", so 90%+ of the time I want hyphen gone.)

2. Replace all broken paragraphs with a "hard" hyphen, then remove bad/inconsistent hyphens at a later stage:

(Regex #1 alt)

Search: -\s+
Replace: -

This would get you:

Code:

<p>This is an ex-</p>
<p>ample.</p>

<p>This is an ex-ample.</p>

Back in 2013, I wrote how to use "Spellcheck Lists" to catch bad/inconsistent hyphenation:

2013: "How do you deal with soft hyphens in OCR texts?"

Personally, I squash everything one-by-one during cleanup.

Finereader tends to introduce issues at page/line breaks (leaving footnotes in the text, tables smack-dab in the middle of split paragraphs, etc.), so this case-by-case hyphen fixing is also a great time to spot/correct those issues!

And then when I get to the Spellcheck List stage, all hyphens can be mass checked/corrected. And since the leftover hyphens there are correct OR actual hyphenation errors that snuck into the book, this is much easier.

Quote:

Originally Posted by patrik

But sometimes Finereader adds table-stuff:

This is a journey
<table border="1">
<tbody>
<tr>
<td></td>

<td>
into sound.

which the regex catches and destroys the table.

Back in 2020, I partially wrote about my "12-step Finereader Cleanup" (Sigil Saved Searches).

Here's the last 5 steps of my Saved Searches dealing with Finereader tables:

Remove Finereader 12 Table Alignment
Search: <td style="vertical-align:[^"]+">
Replace: <td>

Clean Bold td
Search: <td>\s+([^<]+)\s+</td>
Replace: <td>\1</td>

Clean Italics td
Search: <td>\s+([^<]+)\s+</td>
Replace: <td>\1</td>

Clean td
Search: <td>\s+([^<]+)\s+</td>
Replace: <td>\1</td>

Clean Table Headers
Search: <td colspan="([0-9]+)">\s+([^<]+)\s+</td>
Replace: <th colspan="\1">\2</th>

* * *

For ~9 years, I've had those 12 steps stored in my Sigil Saved Searches. 99% of the Finereader HTML cruft is cleaned up and normalized.

Then I could open up a Finereader EPUB, run the group of searches, and within seconds... boom... clean code to use as a base.

Here's an example of an Archive.org book I generated through Finereader PDF -> EPUB -> 12-step cleanup:

2021: "Archive.org ePub" (Post #11)

Seconds to create that EPUB. And compared to the automatically generated "EPUB" version hosted on Archive.org, mine blows it away.

Quote:

Originally Posted by patrik

Any way to catch only what is safe to replace? (And catch where the last or first letter is a valid word with capital letter (like "I")?)

Test out my 3 regexes. You'll be pleasantly surprised at how well it works.