![]() |
My GUESS is you satisfied the FIND with the first match found (no recursive +), which is why you saw the Highlight as it was
|
Quote:
Code:
<a id="Page_([xvi]+)Code:
([\d]+)" class="x-ebookmaker-pageno" title="\[([xvi]+)Code:
([\d]+)\]"></a>Code:
<a id="Page_([xvi]+|[\d]+)" class="x-ebookmaker-pageno" title="\[([xvi]+|[\d]+)\]"><\/a> |
How can I transform uppercase text into lowercase text between tags with RegEx?
Example before: Code:
<p class="tibTrans">LA MA NAM DANG JI DAM KJIL KHOR LHA</p>Code:
<p class="tibTrans">la ma nam dang ji dam kjil khor lha</p>Code:
Find: <p class="tibTrans">(.*?)<\/p>Code:
Replace: <p class="tibTrans">\L$1<\/p> |
Quote:
Code:
Replace: <p class="tibTrans">\L\1<\/p> |
Quote:
|
Often after using Finereader for OCR, some paragraphs are split into two.
Like: <p>This is a journey</p> <p>into sound.</p> which should be: <p>This is a journey into sound.</p> Doing a regex like this: search: ([a-z])</p>.*?<p>([a-z]) replace: \1 \2 seem to work. But sometimes Finereader adds table-stuff: <p>This is a journey</p> <table border="1"> <tbody> <tr> <td></td> <td> <p>into sound.</p> which the regex catches and destroys the table. Any way to catch only what is safe to replace? (And catch where the last or first letter is a valid word with capital letter (like "I")?) |
Try it out, for me it connects paragraphs. You can remove the characters you don't want, e.g. Polish characters.
search: ([[:alpha:],ą,ć,ę,ł,ń,ó,ś,ź,ż,,,;,:,-,–,—,“,”,])</p>\s*<p\b[^>]*> replace: \1 |
Thanks! Much better then my version.
Though, it does catch cases where there should be two paragraphs but a period is missing, not sure if it's possible to differentiate between these "valid" errors...? |
Quote:
Here's a PM I wrote a few months ago with examples: * * * The 3 main "joins" I currently use: Search: -</p>\s+<p> Replace: <--- (Completely blank) and: Search: ([^>”\?\!\.])</p>\s+<p> Replace: \1 <---- (There's a space after the '1') and: Search: <p>[a-z] Replace: <---- (BLANK. Only use for FINDING, NOT REPLACING.) 1st one looks for a hyphen at the end of a paragraph: Code:
<p>This is an ex-</p>Code:
<p>This is an</p>Code:
<blockquote>- - - Usage Note: Then you just have to pay close attention to paragraphs that end in ':', because those all depend on the book/context: Code:
<p>This is a list:</p>Code:
<p>This is a list: One, Two, Three</p>- - - Regex #1 Note: You have to be careful, DO NOT "REPLACE ALL". Not all hyphens are "soft hyphens". Some need to be replaced with an actual hyphen: Code:
The proto-</p>Code:
The proto-European model of [...]1. Deal with "soft" or "hard" hyphens on a case-by-case basis as you go through book. (I find the vast majority out of Finereader are "soft", so 90%+ of the time I want hyphen gone.) 2. Replace all broken paragraphs with a "hard" hyphen, then remove bad/inconsistent hyphens at a later stage: (Regex #1 alt) Search: -</p>\s+<p> Replace: - This would get you: Code:
<p>This is an ex-</p>2013: "How do you deal with soft hyphens in OCR texts?" Personally, I squash everything one-by-one during cleanup. Finereader tends to introduce issues at page/line breaks (leaving footnotes in the text, tables smack-dab in the middle of split paragraphs, etc.), so this case-by-case hyphen fixing is also a great time to spot/correct those issues! And then when I get to the Spellcheck List stage, all hyphens can be mass checked/corrected. And since the leftover hyphens there are correct OR actual hyphenation errors that snuck into the book, this is much easier. :D Quote:
Here's the last 5 steps of my Saved Searches dealing with Finereader tables: Remove Finereader 12 Table Alignment Search: <td style="vertical-align:[^"]+"> Replace: <td> Clean Bold td Search: <td>\s+<p><span class="bold">([^<]+)</span></p>\s+</td> Replace: <td>\1</td> Clean Italics td Search: <td>\s+<p>(<span class="italics">[^<]+</span>)</p>\s+</td> Replace: <td>\1</td> Clean td Search: <td>\s+<p>([^<]+)</p>\s+</td> Replace: <td>\1</td> Clean Table Headers Search: <td colspan="([0-9]+)">\s+<p>([^<]+)</p>\s+</td> Replace: <th colspan="\1">\2</th> * * * For ~9 years, I've had those 12 steps stored in my Sigil Saved Searches. 99% of the Finereader HTML cruft is cleaned up and normalized. Then I could open up a Finereader EPUB, run the group of searches, and within seconds... boom... clean code to use as a base. Here's an example of an Archive.org book I generated through Finereader PDF -> EPUB -> 12-step cleanup: Seconds to create that EPUB. And compared to the automatically generated "EPUB" version hosted on Archive.org, mine blows it away. Quote:
|
Tex2002ans, I'm constantly amazed of what amazing posts you post! Thank you very much! :-)
|
| All times are GMT -4. The time now is 07:52 PM. |
Powered by: vBulletin
Copyright ©2000 - 3.8.5, Jelsoft Enterprises Ltd.
MobileRead.com is a privately owned, operated and funded community.