MobileRead Forums

MobileRead Forums (https://www.mobileread.com/forums/index.php)
-   Sigil (https://www.mobileread.com/forums/forumdisplay.php?f=203)
-   -   Regex examples (https://www.mobileread.com/forums/showthread.php?t=167971)

theducks 10-12-2020 09:57 PM

My GUESS is you satisfied the FIND with the first match found (no recursive +), which is why you saw the Highlight as it was

davidfor 10-13-2020 02:53 AM

Quote:

Originally Posted by hobnail (Post 4046153)
I don't understand why this isn't working; my search string is:

<a id="Page_([xvi]+)|([\d]+)" class="x-ebookmaker-pageno" title="\[([xvi]+)|([\d]+)\]"></a>

When the file contains

<a id="Page_iv" class="x-ebookmaker-pageno" title="[iv]"></a>

and I click on the Find button, it highlights only

<a id="Page_i

What's wrong with my regexp?

The "|" is basically an "or". Your regex is basically search for matches to one of:

Code:

<a id="Page_([xvi]+)
Code:

([\d]+)" class="x-ebookmaker-pageno" title="\[([xvi]+)
Code:

([\d]+)\]"></a>
I think you want:

Code:

<a id="Page_([xvi]+|[\d]+)" class="x-ebookmaker-pageno" title="\[([xvi]+|[\d]+)\]"><\/a>
The two groups are the page number and title in either of the formats.

Skydancer 01-18-2021 05:57 AM

How can I transform uppercase text into lowercase text between tags with RegEx?

Example
before:
Code:

<p class="tibTrans">LA MA NAM DANG JI DAM KJIL KHOR LHA</p>
after:
Code:

<p class="tibTrans">la ma nam dang ji dam kjil khor lha</p>
I tried this, but it doesn't work in Sigil:
Code:

Find: <p class="tibTrans">(.*?)<\/p>
Code:

Replace: <p class="tibTrans">\L$1<\/p>

Doitsu 01-18-2021 07:09 AM

Quote:

Originally Posted by Skydancer (Post 4083495)
I tried this, but it doesn't work in Sigil:
Code:

Find: <p class="tibTrans">(.*?)<\/p>
Code:

Replace: <p class="tibTrans">\L$1<\/p>

Sigil uses the PCRE regex library; you'll need to use backslashes for backreferences.

Code:

Replace: <p class="tibTrans">\L\1<\/p>

hobnail 01-20-2021 01:19 AM

Quote:

Originally Posted by davidfor (Post 4046366)
The "|" is basically an "or". Your regex is basically search for matches to one of:

Code:

<a id="Page_([xvi]+)
Code:

([\d]+)" class="x-ebookmaker-pageno" title="\[([xvi]+)
Code:

([\d]+)\]"></a>
I think you want:

Code:

<a id="Page_([xvi]+|[\d]+)" class="x-ebookmaker-pageno" title="\[([xvi]+|[\d]+)\]"><\/a>
The two groups are the page number and title in either of the formats.

Sorry, somehow I missed your reply (or maybe I forgot that I read it, also quite likely), so this is a belated thanks.

patrik 11-17-2021 01:23 PM

Often after using Finereader for OCR, some paragraphs are split into two.

Like:

<p>This is a journey</p>

<p>into sound.</p>

which should be: <p>This is a journey into sound.</p>

Doing a regex like this:

search: ([a-z])</p>.*?<p>([a-z])
replace: \1 \2

seem to work. But sometimes Finereader adds table-stuff:


<p>This is a journey</p>
<table border="1">
<tbody>
<tr>
<td></td>

<td>
<p>into sound.</p>

which the regex catches and destroys the table.

Any way to catch only what is safe to replace? (And catch where the last or first letter is a valid word with capital letter (like "I")?)

bravosx 11-17-2021 02:06 PM

Try it out, for me it connects paragraphs. You can remove the characters you don't want, e.g. Polish characters.

search: ([[:alpha:],ą,ć,ę,ł,ń,ó,ś,ź,ż,,,;,:,-,–,—,“,”,])</p>\s*<p\b[^>]*>
replace: \1

patrik 11-17-2021 02:42 PM

Thanks! Much better then my version.

Though, it does catch cases where there should be two paragraphs but a period is missing, not sure if it's possible to differentiate between these "valid" errors...?

Tex2002ans 11-17-2021 09:36 PM

Quote:

Originally Posted by patrik (Post 4173339)
Often after using Finereader for OCR, some paragraphs are split into two.

Like:

<p>This is a journey</p>

<p>into sound.</p>

which should be: <p>This is a journey into sound.</p>

For ~9 years, I've been using 3 "join" regexes. They catch the ~99% of broken paragraphs, but they have to be decided on a case-by-case basis.

Here's a PM I wrote a few months ago with examples:

* * *

The 3 main "joins" I currently use:

Search: -</p>\s+<p>
Replace: <--- (Completely blank)

and:

Search: ([^>”\?\!\.])</p>\s+<p>
Replace: \1 <---- (There's a space after the '1')

and:

Search: <p>[a-z]
Replace: <---- (BLANK. Only use for FINDING, NOT REPLACING.)

1st one looks for a hyphen at the end of a paragraph:

Code:

<p>This is an ex-</p>
<p>
ample.</p>

2nd one looks for any paragraph that ends in a NOT closing punctuation:

Code:

<p>This is an</p>
<p>
example.</p>

<p>This is a list of one,</p>
<p>
two, and three.</p>

and 3rd one looks for any leftover paragraphs STARTING with a lowercase letter:

Code:

<blockquote>
        <p>This is a long quote.</p>
</blockquote>

<p>apples, Bananas, Pears...</p>

<p>and Croutons.</p>

Those 3 should catch 99% of the broken paragraphs.

- - -

Usage Note: Then you just have to pay close attention to paragraphs that end in ':', because those all depend on the book/context:

Code:

<p>This is a list:</p>
<p>One, Two, Three</p>

<p>This is a quote:</p>
<p>“Get over here!”</p>

These could be:

Code:

<p>This is a list: One, Two, Three</p>

<p>This is a quote: “Get over here!”</p>

(More in-depth regex might also be needed for ”</p> too, but I don't have any Saved Searches on that. Very rarely do I see those actually get split by Finereader. And usually the "lowercase regex" catches all those.)

- - -

Regex #1 Note: You have to be careful, DO NOT "REPLACE ALL".

Not all hyphens are "soft hyphens". Some need to be replaced with an actual hyphen:

Code:

The proto-</p>

<p>E
uropean model of [...]

would need to become:

Code:

The proto-European model of [...]
It's up to you when/how you want to deal with these. You can:

1. Deal with "soft" or "hard" hyphens on a case-by-case basis as you go through book. (I find the vast majority out of Finereader are "soft", so 90%+ of the time I want hyphen gone.)

2. Replace all broken paragraphs with a "hard" hyphen, then remove bad/inconsistent hyphens at a later stage:

(Regex #1 alt)

Search: -</p>\s+<p>
Replace: -

This would get you:

Code:

<p>This is an ex-</p>
<p>
ample.</p>

<p>This is an ex-ample.</p>

Back in 2013, I wrote how to use "Spellcheck Lists" to catch bad/inconsistent hyphenation:

2013: "How do you deal with soft hyphens in OCR texts?"

Personally, I squash everything one-by-one during cleanup.

Finereader tends to introduce issues at page/line breaks (leaving footnotes in the text, tables smack-dab in the middle of split paragraphs, etc.), so this case-by-case hyphen fixing is also a great time to spot/correct those issues!

And then when I get to the Spellcheck List stage, all hyphens can be mass checked/corrected. And since the leftover hyphens there are correct OR actual hyphenation errors that snuck into the book, this is much easier. :D

Quote:

Originally Posted by patrik (Post 4173339)
But sometimes Finereader adds table-stuff:

<p>This is a journey</p>
<table border="1">
<tbody>
<tr>
<td></td>

<td>
<p>into sound.</p>

which the regex catches and destroys the table.

Back in 2020, I partially wrote about my "12-step Finereader Cleanup" (Sigil Saved Searches).

Here's the last 5 steps of my Saved Searches dealing with Finereader tables:

Remove Finereader 12 Table Alignment
Search: <td style="vertical-align:[^"]+">
Replace: <td>

Clean Bold td
Search: <td>\s+<p><span class="bold">([^<]+)</span></p>\s+</td>
Replace: <td>\1</td>

Clean Italics td
Search: <td>\s+<p>(<span class="italics">[^<]+</span>)</p>\s+</td>
Replace: <td>\1</td>

Clean td
Search: <td>\s+<p>([^<]+)</p>\s+</td>
Replace: <td>\1</td>

Clean Table Headers
Search: <td colspan="([0-9]+)">\s+<p>([^<]+)</p>\s+</td>
Replace: <th colspan="\1">\2</th>

* * *

For ~9 years, I've had those 12 steps stored in my Sigil Saved Searches. 99% of the Finereader HTML cruft is cleaned up and normalized.

Then I could open up a Finereader EPUB, run the group of searches, and within seconds... boom... clean code to use as a base.

Here's an example of an Archive.org book I generated through Finereader PDF -> EPUB -> 12-step cleanup:

Seconds to create that EPUB. And compared to the automatically generated "EPUB" version hosted on Archive.org, mine blows it away.

Quote:

Originally Posted by patrik (Post 4173339)
Any way to catch only what is safe to replace? (And catch where the last or first letter is a valid word with capital letter (like "I")?)

Test out my 3 regexes. You'll be pleasantly surprised at how well it works. :)

patrik 11-18-2021 12:05 PM

Tex2002ans, I'm constantly amazed of what amazing posts you post! Thank you very much! :-)


All times are GMT -4. The time now is 07:52 PM.

Powered by: vBulletin
Copyright ©2000 - 3.8.5, Jelsoft Enterprises Ltd.
MobileRead.com is a privately owned, operated and funded community.