Quote:
Originally Posted by patrik
Often after using Finereader for OCR, some paragraphs are split into two.
Like:
<p>This is a journey</p>
<p>into sound.</p>
which should be: <p>This is a journey into sound.</p>
|
For ~9 years, I've been using 3 "join" regexes. They catch the ~99% of broken paragraphs, but they have to be decided on a case-by-case basis.
Here's a PM I wrote a few months ago with examples:
* * *
The 3 main "joins" I currently use:
Search: -</p>\s+<p>
Replace: <--- (Completely blank)
and:
Search: ([^>”\?\!\.])</p>\s+<p>
Replace: \1 <---- (There's a space after the '1')
and:
Search: <p>[a-z]
Replace: <---- (BLANK. Only use for FINDING, NOT REPLACING.)
1st one looks for a hyphen at the end of a paragraph:
Code:
<p>This is an ex-</p>
<p>ample.</p>
2nd one looks for any paragraph that ends in a NOT closing punctuation:
Code:
<p>This is an</p>
<p>example.</p>
<p>This is a list of one,</p>
<p>two, and three.</p>
and 3rd one looks for any leftover paragraphs STARTING with a lowercase letter:
Code:
<blockquote>
<p>This is a long quote.</p>
</blockquote>
<p>apples, Bananas, Pears...</p>
<p>and Croutons.</p>
Those 3 should catch 99% of the broken paragraphs.
- - -
Usage Note: Then you just have to pay close attention to paragraphs that end in ':', because those all depend on the book/context:
Code:
<p>This is a list:</p>
<p>One, Two, Three</p>
<p>This is a quote:</p>
<p>“Get over here!”</p>
These could be:
Code:
<p>This is a list: One, Two, Three</p>
<p>This is a quote: “Get over here!”</p>
(More in-depth regex might also be needed for ”</p> too, but I don't have any Saved Searches on that. Very rarely do I see those actually get split by Finereader. And usually the "lowercase regex" catches all those.)
- - -
Regex #1 Note: You have to be careful, DO NOT "REPLACE ALL".
Not all hyphens are "soft hyphens". Some need to be replaced with an actual hyphen:
Code:
The proto-</p>
<p>European model of [...]
would need to become:
Code:
The proto-European model of [...]
It's up to you when/how you want to deal with these. You can:
1. Deal with "soft" or "hard" hyphens on a case-by-case basis as you go through book. (I find the vast majority out of Finereader are "soft", so 90%+ of the time I want hyphen gone.)
2. Replace all broken paragraphs with a "hard" hyphen, then remove bad/inconsistent hyphens at a later stage:
(Regex #1 alt)
Search: -</p>\s+<p>
Replace: -
This would get you:
Code:
<p>This is an ex-</p>
<p>ample.</p>
<p>This is an ex-ample.</p>
Back in 2013, I wrote how to use "Spellcheck Lists" to catch bad/inconsistent hyphenation:
2013: "How do you deal with soft hyphens in OCR texts?"
Personally, I squash everything one-by-one during cleanup.
Finereader tends to introduce issues at page/line breaks (leaving footnotes in the text, tables smack-dab in the middle of split paragraphs, etc.), so this case-by-case hyphen fixing is also a great time to spot/correct those issues!
And then when I get to the Spellcheck List stage, all hyphens can be mass checked/corrected. And since the leftover hyphens there are correct OR
actual hyphenation errors that snuck into the book, this is much easier.
Quote:
Originally Posted by patrik
But sometimes Finereader adds table-stuff:
<p>This is a journey</p>
<table border="1">
<tbody>
<tr>
<td></td>
<td>
<p>into sound.</p>
which the regex catches and destroys the table.
|
Back in 2020, I partially wrote about my
"12-step Finereader Cleanup" (Sigil Saved Searches).
Here's the last 5 steps of my Saved Searches dealing with Finereader tables:
Remove Finereader 12 Table Alignment
Search: <td style="vertical-align:[^"]+">
Replace: <td>
Clean Bold td
Search: <td>\s+<p><span class="bold">([^<]+)</span></p>\s+</td>
Replace: <td>\1</td>
Clean Italics td
Search: <td>\s+<p>(<span class="italics">[^<]+</span>)</p>\s+</td>
Replace: <td>\1</td>
Clean td
Search: <td>\s+<p>([^<]+)</p>\s+</td>
Replace: <td>\1</td>
Clean Table Headers
Search: <td colspan="([0-9]+)">\s+<p>([^<]+)</p>\s+</td>
Replace: <th colspan="\1">\2</th>
* * *
For ~9 years, I've had those 12 steps stored in my Sigil Saved Searches. 99% of the Finereader HTML cruft is cleaned up and normalized.
Then I could open up a Finereader EPUB, run the group of searches, and within seconds... boom... clean code to use as a base.
Here's an example of an Archive.org book I generated through Finereader PDF -> EPUB -> 12-step cleanup:
Seconds to create that EPUB. And compared to the automatically generated "EPUB" version hosted on Archive.org, mine blows it away.
Quote:
Originally Posted by patrik
Any way to catch only what is safe to replace? (And catch where the last or first letter is a valid word with capital letter (like "I")?)
|
Test out my 3 regexes. You'll be pleasantly surprised at how well it works.