MobileRead Forums - View Single Post

patrik · 11-17-2021, 12:23 PM

Often after using Finereader for OCR, some paragraphs are split into two.

Like:

This is a journey

into sound.

which should be: This is a journey into sound.

Doing a regex like this:

search: ([a-z]).*?([a-z])
replace: \1 \2

seem to work. But sometimes Finereader adds table-stuff:

This is a journey
<table border="1">
<tbody>
<tr>
<td></td>

<td>
into sound.

which the regex catches and destroys the table.

Any way to catch only what is safe to replace? (And catch where the last or first letter is a valid word with capital letter (like "I")?)

11-17-2021, 12:23 PM	#686
patrik Guru Posts: 684 Karma: 4568205 Join Date: Jan 2010 Location: Sweden Device: Kobo Forma	Often after using Finereader for OCR, some paragraphs are split into two. Like: <p>This is a journey</p> <p>into sound.</p> which should be: <p>This is a journey into sound.</p> Doing a regex like this: search: ([a-z])</p>.?<p>([a-z]) replace: \1 \2 seem to work. But sometimes Finereader adds table-stuff: <p>This is a journey</p> <table border="1"> <tbody> <tr> <td></td> <td> <p>into sound.</p> which the regex catches and destroys the table. Any way to catch only what is safe to replace? (And catch where the last or first letter is a valid word with capital letter (like "I")?) Last edited by patrik; 11-17-2021 at 12:32 PM.*