Thread: Regex examples
View Single Post
Old 11-17-2021, 12:23 PM   #686
patrik
Guru
patrik ought to be getting tired of karma fortunes by now.patrik ought to be getting tired of karma fortunes by now.patrik ought to be getting tired of karma fortunes by now.patrik ought to be getting tired of karma fortunes by now.patrik ought to be getting tired of karma fortunes by now.patrik ought to be getting tired of karma fortunes by now.patrik ought to be getting tired of karma fortunes by now.patrik ought to be getting tired of karma fortunes by now.patrik ought to be getting tired of karma fortunes by now.patrik ought to be getting tired of karma fortunes by now.patrik ought to be getting tired of karma fortunes by now.
 
Posts: 684
Karma: 4568205
Join Date: Jan 2010
Location: Sweden
Device: Kobo Forma
Often after using Finereader for OCR, some paragraphs are split into two.

Like:

<p>This is a journey</p>

<p>into sound.</p>

which should be: <p>This is a journey into sound.</p>

Doing a regex like this:

search: ([a-z])</p>.*?<p>([a-z])
replace: \1 \2

seem to work. But sometimes Finereader adds table-stuff:


<p>This is a journey</p>
<table border="1">
<tbody>
<tr>
<td></td>

<td>
<p>into sound.</p>

which the regex catches and destroys the table.

Any way to catch only what is safe to replace? (And catch where the last or first letter is a valid word with capital letter (like "I")?)

Last edited by patrik; 11-17-2021 at 12:32 PM.
patrik is offline   Reply With Quote