Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Sigil

Notices

Reply
 
Thread Tools Search this Thread
Old 10-12-2020, 03:19 PM   #676
BeckyEbook
Guru
BeckyEbook ought to be getting tired of karma fortunes by now.BeckyEbook ought to be getting tired of karma fortunes by now.BeckyEbook ought to be getting tired of karma fortunes by now.BeckyEbook ought to be getting tired of karma fortunes by now.BeckyEbook ought to be getting tired of karma fortunes by now.BeckyEbook ought to be getting tired of karma fortunes by now.BeckyEbook ought to be getting tired of karma fortunes by now.BeckyEbook ought to be getting tired of karma fortunes by now.BeckyEbook ought to be getting tired of karma fortunes by now.BeckyEbook ought to be getting tired of karma fortunes by now.BeckyEbook ought to be getting tired of karma fortunes by now.
 
BeckyEbook's Avatar
 
Posts: 664
Karma: 2180702
Join Date: Jan 2017
Location: Poland
Device: Kindle (Key3, PW2, PW3), Nook (ST, GLP), Kobo Touch, Tolino Vision 2
Use:
Code:
<a id="Page_([xvi]+|\d+)" class="x-ebookmaker-pageno" title="\[([xvi]+|\d+)\]"></a>

Last edited by BeckyEbook; 10-12-2020 at 06:17 PM. Reason: Fix
BeckyEbook is offline   Reply With Quote
Old 10-12-2020, 04:45 PM   #677
theducks
Well trained by Cats
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 29,689
Karma: 54369090
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
A WAG
I would say your OR is flawed

Code:
<a id="Page_([xvi]+)|[\d]+)" class="x-ebookmaker-pageno" title="\[([xvi]+)|([\d]+)\]"></a>
You want 1 capture for either condition

But it may also be simplified
Use the captured valu as part of the second part
title=\1

Last edited by theducks; 10-12-2020 at 04:48 PM. Reason: added simplified
theducks is offline   Reply With Quote
Advert
Old 10-12-2020, 07:36 PM   #678
hobnail
Running with scissors
hobnail ought to be getting tired of karma fortunes by now.hobnail ought to be getting tired of karma fortunes by now.hobnail ought to be getting tired of karma fortunes by now.hobnail ought to be getting tired of karma fortunes by now.hobnail ought to be getting tired of karma fortunes by now.hobnail ought to be getting tired of karma fortunes by now.hobnail ought to be getting tired of karma fortunes by now.hobnail ought to be getting tired of karma fortunes by now.hobnail ought to be getting tired of karma fortunes by now.hobnail ought to be getting tired of karma fortunes by now.hobnail ought to be getting tired of karma fortunes by now.
 
Posts: 1,552
Karma: 14325282
Join Date: Nov 2019
Device: none
Quote:
Originally Posted by BeckyEbook View Post
Use:
Code:
<a id="Page_([xvi]+|\d+)" class="x-ebookmaker-pageno" title="\[([xvi]+|\d+)\]"></a>
Thanks, that worked, as did theducks' answer (with the added square brackets for the title).


Removing the parentheses from the first part so that it's


Page_[xvi]+|\d+


also works (although no capture to reuse for the title).
(I thought it worked the first time I tried it but just now it did not.)

So why did my extra parentheses screw it up?


I added the parentheses so that it was clear, to me at least, what the OR was for.

Last edited by hobnail; 10-12-2020 at 07:56 PM.
hobnail is offline   Reply With Quote
Old 10-12-2020, 08:18 PM   #679
hobnail
Running with scissors
hobnail ought to be getting tired of karma fortunes by now.hobnail ought to be getting tired of karma fortunes by now.hobnail ought to be getting tired of karma fortunes by now.hobnail ought to be getting tired of karma fortunes by now.hobnail ought to be getting tired of karma fortunes by now.hobnail ought to be getting tired of karma fortunes by now.hobnail ought to be getting tired of karma fortunes by now.hobnail ought to be getting tired of karma fortunes by now.hobnail ought to be getting tired of karma fortunes by now.hobnail ought to be getting tired of karma fortunes by now.hobnail ought to be getting tired of karma fortunes by now.
 
Posts: 1,552
Karma: 14325282
Join Date: Nov 2019
Device: none
Quote:
Originally Posted by hobnail View Post
So why did my extra parentheses screw it up?
I think I understand it; the parentheses are telling the or bar what it's working on. It's not like a regular programming language where you could say "boolean a = (b) | (c);"

So I'm guessing I could add some extra parentheses around it and it would still work, but I haven't tested it; Page_(([xvi]+)|(\d+))

And I didn't know that you could use \1 in the same regexp; I thought you could only use it in the replacement part. That's nice to know.

Last edited by hobnail; 10-12-2020 at 08:21 PM.
hobnail is offline   Reply With Quote
Old 10-12-2020, 08:19 PM   #680
DNSB
Bibliophagist
DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.
 
DNSB's Avatar
 
Posts: 34,517
Karma: 144552660
Join Date: Jul 2010
Location: Vancouver
Device: Kobo Sage, Forma, Clara HD, Lenovo M8 FHD, Paperwhite 4, Tolino epos
Quote:
Originally Posted by hobnail View Post
So why did my extra parentheses screw it up?

I added the parentheses so that it was clear, to me at least, what the OR was for.
In regex, parentheses are special characters. That's why you end up needing to escape a literal parenthesis with a \, (text) is capturing parentheses unless you start the text inside the parentheses with a ?: i.e. (?:text) for non-capturing parentheses. I seem to remember a 4th variety of parentheses but not sure about what flavour of regex that was in.

Basically, you can't use them as separators as you would in a mathematical expression.
DNSB is offline   Reply With Quote
Advert
Old 10-12-2020, 08:57 PM   #681
theducks
Well trained by Cats
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 29,689
Karma: 54369090
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
My GUESS is you satisfied the FIND with the first match found (no recursive +), which is why you saw the Highlight as it was
theducks is offline   Reply With Quote
Old 10-13-2020, 01:53 AM   #682
davidfor
Grand Sorcerer
davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.
 
Posts: 24,908
Karma: 47303748
Join Date: Jul 2011
Location: Sydney, Australia
Device: Kobo:Touch,Glo, AuraH2O, GloHD,AuraONE, ClaraHD, Libra H2O; tolinoepos
Quote:
Originally Posted by hobnail View Post
I don't understand why this isn't working; my search string is:

<a id="Page_([xvi]+)|([\d]+)" class="x-ebookmaker-pageno" title="\[([xvi]+)|([\d]+)\]"></a>

When the file contains

<a id="Page_iv" class="x-ebookmaker-pageno" title="[iv]"></a>

and I click on the Find button, it highlights only

<a id="Page_i

What's wrong with my regexp?
The "|" is basically an "or". Your regex is basically search for matches to one of:

Code:
<a id="Page_([xvi]+)
Code:
([\d]+)" class="x-ebookmaker-pageno" title="\[([xvi]+)
Code:
([\d]+)\]"></a>
I think you want:

Code:
<a id="Page_([xvi]+|[\d]+)" class="x-ebookmaker-pageno" title="\[([xvi]+|[\d]+)\]"><\/a>
The two groups are the page number and title in either of the formats.

Last edited by davidfor; 10-13-2020 at 02:07 AM. Reason: Remember to refresh before replying....
davidfor is offline   Reply With Quote
Old 01-18-2021, 04:57 AM   #683
Skydancer
Enthusiast
Skydancer began at the beginning.
 
Skydancer's Avatar
 
Posts: 30
Karma: 10
Join Date: Mar 2019
Location: Slovenia
Device: PocketBoot Inkpad 3
How can I transform uppercase text into lowercase text between tags with RegEx?

Example
before:
Code:
<p class="tibTrans">LA MA NAM DANG JI DAM KJIL KHOR LHA</p>
after:
Code:
<p class="tibTrans">la ma nam dang ji dam kjil khor lha</p>
I tried this, but it doesn't work in Sigil:
Code:
Find: <p class="tibTrans">(.*?)<\/p>
Code:
Replace: <p class="tibTrans">\L$1<\/p>

Last edited by Skydancer; 01-18-2021 at 05:30 AM.
Skydancer is offline   Reply With Quote
Old 01-18-2021, 06:09 AM   #684
Doitsu
Grand Sorcerer
Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.
 
Doitsu's Avatar
 
Posts: 5,582
Karma: 22735033
Join Date: Dec 2010
Device: Kindle PW2
Quote:
Originally Posted by Skydancer View Post
I tried this, but it doesn't work in Sigil:
Code:
Find: <p class="tibTrans">(.*?)<\/p>
Code:
Replace: <p class="tibTrans">\L$1<\/p>
Sigil uses the PCRE regex library; you'll need to use backslashes for backreferences.

Code:
Replace: <p class="tibTrans">\L\1<\/p>
Doitsu is offline   Reply With Quote
Old 01-20-2021, 12:19 AM   #685
hobnail
Running with scissors
hobnail ought to be getting tired of karma fortunes by now.hobnail ought to be getting tired of karma fortunes by now.hobnail ought to be getting tired of karma fortunes by now.hobnail ought to be getting tired of karma fortunes by now.hobnail ought to be getting tired of karma fortunes by now.hobnail ought to be getting tired of karma fortunes by now.hobnail ought to be getting tired of karma fortunes by now.hobnail ought to be getting tired of karma fortunes by now.hobnail ought to be getting tired of karma fortunes by now.hobnail ought to be getting tired of karma fortunes by now.hobnail ought to be getting tired of karma fortunes by now.
 
Posts: 1,552
Karma: 14325282
Join Date: Nov 2019
Device: none
Quote:
Originally Posted by davidfor View Post
The "|" is basically an "or". Your regex is basically search for matches to one of:

Code:
<a id="Page_([xvi]+)
Code:
([\d]+)" class="x-ebookmaker-pageno" title="\[([xvi]+)
Code:
([\d]+)\]"></a>
I think you want:

Code:
<a id="Page_([xvi]+|[\d]+)" class="x-ebookmaker-pageno" title="\[([xvi]+|[\d]+)\]"><\/a>
The two groups are the page number and title in either of the formats.
Sorry, somehow I missed your reply (or maybe I forgot that I read it, also quite likely), so this is a belated thanks.
hobnail is offline   Reply With Quote
Old 11-17-2021, 12:23 PM   #686
patrik
Guru
patrik ought to be getting tired of karma fortunes by now.patrik ought to be getting tired of karma fortunes by now.patrik ought to be getting tired of karma fortunes by now.patrik ought to be getting tired of karma fortunes by now.patrik ought to be getting tired of karma fortunes by now.patrik ought to be getting tired of karma fortunes by now.patrik ought to be getting tired of karma fortunes by now.patrik ought to be getting tired of karma fortunes by now.patrik ought to be getting tired of karma fortunes by now.patrik ought to be getting tired of karma fortunes by now.patrik ought to be getting tired of karma fortunes by now.
 
Posts: 647
Karma: 4566069
Join Date: Jan 2010
Location: Sweden
Device: Kobo Forma
Often after using Finereader for OCR, some paragraphs are split into two.

Like:

<p>This is a journey</p>

<p>into sound.</p>

which should be: <p>This is a journey into sound.</p>

Doing a regex like this:

search: ([a-z])</p>.*?<p>([a-z])
replace: \1 \2

seem to work. But sometimes Finereader adds table-stuff:


<p>This is a journey</p>
<table border="1">
<tbody>
<tr>
<td></td>

<td>
<p>into sound.</p>

which the regex catches and destroys the table.

Any way to catch only what is safe to replace? (And catch where the last or first letter is a valid word with capital letter (like "I")?)

Last edited by patrik; 11-17-2021 at 12:32 PM.
patrik is offline   Reply With Quote
Old 11-17-2021, 01:06 PM   #687
bravosx
Connoisseur
bravosx began at the beginning.
 
Posts: 99
Karma: 10
Join Date: Jun 2014
Location: Poland, Żory
Device: Prestigio PER3464B, Onyx Lynx, Lenovo S5000 i Tab4-8"
Try it out, for me it connects paragraphs. You can remove the characters you don't want, e.g. Polish characters.

search: ([[:alpha:],ą,ć,ę,ł,ń,ó,ś,ź,ż,,,;,:,-,–,—,“,”,])</p>\s*<p\b[^>]*>
replace: \1
bravosx is online now   Reply With Quote
Old 11-17-2021, 01:42 PM   #688
patrik
Guru
patrik ought to be getting tired of karma fortunes by now.patrik ought to be getting tired of karma fortunes by now.patrik ought to be getting tired of karma fortunes by now.patrik ought to be getting tired of karma fortunes by now.patrik ought to be getting tired of karma fortunes by now.patrik ought to be getting tired of karma fortunes by now.patrik ought to be getting tired of karma fortunes by now.patrik ought to be getting tired of karma fortunes by now.patrik ought to be getting tired of karma fortunes by now.patrik ought to be getting tired of karma fortunes by now.patrik ought to be getting tired of karma fortunes by now.
 
Posts: 647
Karma: 4566069
Join Date: Jan 2010
Location: Sweden
Device: Kobo Forma
Thanks! Much better then my version.

Though, it does catch cases where there should be two paragraphs but a period is missing, not sure if it's possible to differentiate between these "valid" errors...?
patrik is offline   Reply With Quote
Old 11-17-2021, 08:36 PM   #689
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by patrik View Post
Often after using Finereader for OCR, some paragraphs are split into two.

Like:

<p>This is a journey</p>

<p>into sound.</p>

which should be: <p>This is a journey into sound.</p>
For ~9 years, I've been using 3 "join" regexes. They catch the ~99% of broken paragraphs, but they have to be decided on a case-by-case basis.

Here's a PM I wrote a few months ago with examples:

* * *

The 3 main "joins" I currently use:

Search: -</p>\s+<p>
Replace: <--- (Completely blank)

and:

Search: ([^>”\?\!\.])</p>\s+<p>
Replace: \1 <---- (There's a space after the '1')

and:

Search: <p>[a-z]
Replace: <---- (BLANK. Only use for FINDING, NOT REPLACING.)

1st one looks for a hyphen at the end of a paragraph:

Code:
<p>This is an ex-</p>
<p>ample.</p>
2nd one looks for any paragraph that ends in a NOT closing punctuation:

Code:
<p>This is an</p>
<p>example.</p>

<p>This is a list of one,</p>
<p>two, and three.</p>
and 3rd one looks for any leftover paragraphs STARTING with a lowercase letter:

Code:
<blockquote>
	<p>This is a long quote.</p>
</blockquote>

<p>apples, Bananas, Pears...</p>

<p>and Croutons.</p>
Those 3 should catch 99% of the broken paragraphs.

- - -

Usage Note: Then you just have to pay close attention to paragraphs that end in ':', because those all depend on the book/context:

Code:
<p>This is a list:</p>
<p>One, Two, Three</p>

<p>This is a quote:</p>
<p>“Get over here!”</p>
These could be:

Code:
<p>This is a list: One, Two, Three</p>

<p>This is a quote: “Get over here!”</p>
(More in-depth regex might also be needed for ”</p> too, but I don't have any Saved Searches on that. Very rarely do I see those actually get split by Finereader. And usually the "lowercase regex" catches all those.)

- - -

Regex #1 Note: You have to be careful, DO NOT "REPLACE ALL".

Not all hyphens are "soft hyphens". Some need to be replaced with an actual hyphen:

Code:
The proto-</p>

<p>European model of [...]
would need to become:

Code:
The proto-European model of [...]
It's up to you when/how you want to deal with these. You can:

1. Deal with "soft" or "hard" hyphens on a case-by-case basis as you go through book. (I find the vast majority out of Finereader are "soft", so 90%+ of the time I want hyphen gone.)

2. Replace all broken paragraphs with a "hard" hyphen, then remove bad/inconsistent hyphens at a later stage:

(Regex #1 alt)

Search: -</p>\s+<p>
Replace: -

This would get you:

Code:
<p>This is an ex-</p>
<p>ample.</p>

<p>This is an ex-ample.</p>
Back in 2013, I wrote how to use "Spellcheck Lists" to catch bad/inconsistent hyphenation:

2013: "How do you deal with soft hyphens in OCR texts?"

Personally, I squash everything one-by-one during cleanup.

Finereader tends to introduce issues at page/line breaks (leaving footnotes in the text, tables smack-dab in the middle of split paragraphs, etc.), so this case-by-case hyphen fixing is also a great time to spot/correct those issues!

And then when I get to the Spellcheck List stage, all hyphens can be mass checked/corrected. And since the leftover hyphens there are correct OR actual hyphenation errors that snuck into the book, this is much easier.

Quote:
Originally Posted by patrik View Post
But sometimes Finereader adds table-stuff:

<p>This is a journey</p>
<table border="1">
<tbody>
<tr>
<td></td>

<td>
<p>into sound.</p>

which the regex catches and destroys the table.
Back in 2020, I partially wrote about my "12-step Finereader Cleanup" (Sigil Saved Searches).

Here's the last 5 steps of my Saved Searches dealing with Finereader tables:

Remove Finereader 12 Table Alignment
Search: <td style="vertical-align:[^"]+">
Replace: <td>

Clean Bold td
Search: <td>\s+<p><span class="bold">([^<]+)</span></p>\s+</td>
Replace: <td>\1</td>

Clean Italics td
Search: <td>\s+<p>(<span class="italics">[^<]+</span>)</p>\s+</td>
Replace: <td>\1</td>

Clean td
Search: <td>\s+<p>([^<]+)</p>\s+</td>
Replace: <td>\1</td>

Clean Table Headers
Search: <td colspan="([0-9]+)">\s+<p>([^<]+)</p>\s+</td>
Replace: <th colspan="\1">\2</th>

* * *

For ~9 years, I've had those 12 steps stored in my Sigil Saved Searches. 99% of the Finereader HTML cruft is cleaned up and normalized.

Then I could open up a Finereader EPUB, run the group of searches, and within seconds... boom... clean code to use as a base.

Here's an example of an Archive.org book I generated through Finereader PDF -> EPUB -> 12-step cleanup:

Seconds to create that EPUB. And compared to the automatically generated "EPUB" version hosted on Archive.org, mine blows it away.

Quote:
Originally Posted by patrik View Post
Any way to catch only what is safe to replace? (And catch where the last or first letter is a valid word with capital letter (like "I")?)
Test out my 3 regexes. You'll be pleasantly surprised at how well it works.

Last edited by Tex2002ans; 11-18-2021 at 02:47 PM.
Tex2002ans is offline   Reply With Quote
Old 11-18-2021, 11:05 AM   #690
patrik
Guru
patrik ought to be getting tired of karma fortunes by now.patrik ought to be getting tired of karma fortunes by now.patrik ought to be getting tired of karma fortunes by now.patrik ought to be getting tired of karma fortunes by now.patrik ought to be getting tired of karma fortunes by now.patrik ought to be getting tired of karma fortunes by now.patrik ought to be getting tired of karma fortunes by now.patrik ought to be getting tired of karma fortunes by now.patrik ought to be getting tired of karma fortunes by now.patrik ought to be getting tired of karma fortunes by now.patrik ought to be getting tired of karma fortunes by now.
 
Posts: 647
Karma: 4566069
Join Date: Jan 2010
Location: Sweden
Device: Kobo Forma
Tex2002ans, I'm constantly amazed of what amazing posts you post! Thank you very much! :-)
patrik is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Examples of Subgroups emonti8384 Lounge 32 02-26-2011 06:00 PM
Accessories Pen examples Gunnerp245 enTourage Archive 15 02-21-2011 03:23 PM
Stylesheet examples? Skitzman69 Sigil 15 09-24-2010 08:24 PM
Examples kafkaesque1978 iRiver Story 1 07-26-2010 03:49 PM
Looking for examples of typos in eBooks Tonycole General Discussions 1 05-05-2010 04:23 AM


All times are GMT -4. The time now is 10:30 AM.


MobileRead.com is a privately owned, operated and funded community.