Often after using Finereader for OCR, some paragraphs are split into two.
Like:
<p>This is a journey</p>
<p>into sound.</p>
which should be: <p>This is a journey into sound.</p>
Doing a regex like this:
search: ([a-z])</p>.*?<p>([a-z])
replace: \1 \2
seem to work. But sometimes Finereader adds table-stuff:
<p>This is a journey</p>
<table border="1">
<tbody>
<tr>
<td></td>
<td>
<p>into sound.</p>
which the regex catches and destroys the table.
Any way to catch only what is safe to replace? (And catch where the last or first letter is a valid word with capital letter (like "I")?)
Last edited by patrik; 11-17-2021 at 12:32 PM.
|