Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Editor

Notices

Reply
 
Thread Tools Search this Thread
Old 06-17-2022, 08:58 PM   #1
jordy1955
Junior Member
jordy1955 began at the beginning.
 
Posts: 7
Karma: 10
Join Date: Aug 2021
Device: Kindle
Need help with regex

Hi,
Firstly let me say that I am a very rudimentary user of regex. Most of it is beyond my comprehension.

I have some eBooks that were clearly produced by less than spectacular OCR software.

Accordingly, the formatting ranges from quite good to really bad.

One of the main problems is line breaks in the wrong places (eg in the middle of a sentence), making the text very difficult to follow.

In F&R I have used this "[a-z]</p><p class="calibre_1">" - or similar - to quite successfully find these instances, but the problem is that the entirety of the matched regex is selected and I cannot for the life of me work out how to get the replace function to disregard the [a-z] component of the result in order to avoid what can be hundreds of manual interventions to fix all the errors.

Any assistance is gratefully accepted.

thanks

Paul

Last edited by jordy1955; 06-17-2022 at 09:02 PM.
jordy1955 is offline   Reply With Quote
Old 06-17-2022, 09:25 PM   #2
Sarmat89
Fanatic
Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.
 
Posts: 518
Karma: 2268308
Join Date: Nov 2015
Device: none
Use
(?<=\p{Ll})</p>\s*<p class="...">
Sarmat89 is offline   Reply With Quote
Advert
Old 06-17-2022, 09:54 PM   #3
jordy1955
Junior Member
jordy1955 began at the beginning.
 
Posts: 7
Karma: 10
Join Date: Aug 2021
Device: Kindle
Quote:
Originally Posted by Sarmat89 View Post
Use
(?<=\p{Ll})</p>\s*<p class="...">
Thanks for this, I have no idea what it does but I'm guessing that somehow it ignores the text captured BEFORE the '</p><p class="calibre_1">'.


Thankyou.
jordy1955 is offline   Reply With Quote
Old 06-17-2022, 10:08 PM   #4
jordy1955
Junior Member
jordy1955 began at the beginning.
 
Posts: 7
Karma: 10
Join Date: Aug 2021
Device: Kindle
Quote:
Originally Posted by Sarmat89 View Post
Use
(?<=\p{Ll})</p>\s*<p class="...">
OK I tried this (copy and paste from this thread) in the find field and it does not find anything whereas [a-z]</p><p class="calibre_1"> does...

Is there something I'm missing?
jordy1955 is offline   Reply With Quote
Old 06-17-2022, 10:12 PM   #5
davidfor
Grand Sorcerer
davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.
 
Posts: 24,905
Karma: 47303824
Join Date: Jul 2011
Location: Sydney, Australia
Device: Kobo:Touch,Glo, AuraH2O, GloHD,AuraONE, ClaraHD, Libra H2O; tolinoepos
Quote:
Originally Posted by Sarmat89 View Post
Use
(?<=\p{Ll})</p>\s*<p class="...">
That doesn't work for me and I can't work put what the look behind is supposed to do.

I use:

Code:
([\w,—])</p>\s*<p\s*[^>]*?>([\w])
with the replace of:

Code:
\1 \2
It doesn't catch everything (it probably should have the other types of dash) but it catches most.
davidfor is offline   Reply With Quote
Advert
Old 06-17-2022, 10:18 PM   #6
jordy1955
Junior Member
jordy1955 began at the beginning.
 
Posts: 7
Karma: 10
Join Date: Aug 2021
Device: Kindle
This is what my query returns.

I need to exclude the Single char - in this case the "E" - either in the search result or exclude it in the replace function.
Attached Thumbnails
Click image for larger version

Name:	Screenshot 2022-06-18 114406.jpg
Views:	118
Size:	37.6 KB
ID:	194431  

Last edited by jordy1955; 06-17-2022 at 10:20 PM. Reason: typo
jordy1955 is offline   Reply With Quote
Old 06-17-2022, 10:22 PM   #7
jordy1955
Junior Member
jordy1955 began at the beginning.
 
Posts: 7
Karma: 10
Join Date: Aug 2021
Device: Kindle
Quote:
Originally Posted by davidfor View Post
That doesn't work for me and I can't work put what the look behind is supposed to do.

I use:

Code:
([\w,—])</p>\s*<p\s*[^>]*?>([\w])
with the replace of:

Code:
\1 \2
It doesn't catch everything (it probably should have the other types of dash) but it catches most.

This works, BUT, it also returns the 1st char of the following word - see image

How then do I exclude the unwanted chars in the replace field? i've got no idea what the \1 \2 means
Attached Thumbnails
Click image for larger version

Name:	Screenshot 2022-06-18 115204.jpg
Views:	110
Size:	38.6 KB
ID:	194432  

Last edited by jordy1955; 06-17-2022 at 10:24 PM.
jordy1955 is offline   Reply With Quote
Old 06-17-2022, 10:33 PM   #8
jordy1955
Junior Member
jordy1955 began at the beginning.
 
Posts: 7
Karma: 10
Join Date: Aug 2021
Device: Kindle
Quote:
Originally Posted by jordy1955 View Post
This works, BUT, it also returns the 1st char of the following word - see image

How then do I exclude the unwanted chars in the replace field? i've got no idea what the \1 \2 means
I did say that I'm pretty basic in my understanding of regex... I just realised what the \1 \2 does. Tested it and it works beautifully.

thankyou so much. You have saved me hours of manual intervention and frustration
jordy1955 is offline   Reply With Quote
Old 06-17-2022, 11:03 PM   #9
davidfor
Grand Sorcerer
davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.
 
Posts: 24,905
Karma: 47303824
Join Date: Jul 2011
Location: Sydney, Australia
Device: Kobo:Touch,Glo, AuraH2O, GloHD,AuraONE, ClaraHD, Libra H2O; tolinoepos
Quote:
Originally Posted by jordy1955 View Post
I did say that I'm pretty basic in my understanding of regex... I just realised what the \1 \2 does. Tested it and it works beautifully.

thankyou so much. You have saved me hours of manual intervention and frustration
The parentheses are "capture groups". And then "\1" is the first group, "\2" is the second and so on.

Another I have used recently was:

Code:
([[:lower:]])\s*</p>\s*<p>\s*([[:lower:]])
That is for when the paragraph ends in a lower case letter and the next starts with a lower case letter. Maybe with the spaces. For that, I am sure it is a paragraph that has been split. For the first one, I generally look at them to check what is actually intended.

And this one doesn't cater for the class. If I am doing this amount of fixing, I remove the class for the normal paragraph. If there are any left, it probably means there is other formatting that I probably don't want to lose.
davidfor is offline   Reply With Quote
Old 06-18-2022, 12:00 AM   #10
enuddleyarbl
Guru
enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.
 
enuddleyarbl's Avatar
 
Posts: 776
Karma: 1538394
Join Date: Sep 2013
Device: Kobo Forma
Quote:
Originally Posted by davidfor View Post
...I use:

Code:
([\w,—])</p>\s*<p\s*[^>]*?>([\w])
with the replace of:

Code:
\1 \2
It doesn't catch everything (it probably should have the other types of dash) but it catches most.
It might be a dependency of the editor, but in Calibre's you'd have to escape the "/" in "</p>". So, it would be:

Code:
([\w,—])<\/p>\s*<p\s*[^>]*?>([\w])
BTW: it's an amazing thing, but I came here looking for the exact same thing just now.
enuddleyarbl is offline   Reply With Quote
Old 06-18-2022, 12:17 AM   #11
enuddleyarbl
Guru
enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.
 
enuddleyarbl's Avatar
 
Posts: 776
Karma: 1538394
Join Date: Sep 2013
Device: Kobo Forma
@jordy1955: I've been using

https://regex101.com/

to try various regex things and see what they do. It's been a lot of help.

One thing to note, though, the replacement character they use there is a $ instead of the \ used in Calibre's editor. So, if you wanted to test davidfor's replacement string of:

Code:
\1 \2
there, you'd have to use:

Code:
$1 $2
enuddleyarbl is offline   Reply With Quote
Old 06-18-2022, 01:42 AM   #12
davidfor
Grand Sorcerer
davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.
 
Posts: 24,905
Karma: 47303824
Join Date: Jul 2011
Location: Sydney, Australia
Device: Kobo:Touch,Glo, AuraH2O, GloHD,AuraONE, ClaraHD, Libra H2O; tolinoepos
Quote:
Originally Posted by DaveLessnau View Post
It might be a dependency of the editor, but in Calibre's you'd have to escape the "/" in "</p>". So, it would be:

Code:
([\w,—])<\/p>\s*<p\s*[^>]*?>([\w])
No, its a Regex version thing. Calibre uses Python's regex. That doesn't need the forward slash to be escaped. For others, such as PCRE, it will need to be escaped.

Quote:
Originally Posted by DaveLessnau View Post
@jordy1955: I've been using

https://regex101.com/

to try various regex things and see what they do. It's been a lot of help.

One thing to note, though, the replacement character they use there is a $ instead of the \ used in Calibre's editor. So, if you wanted to test davidfor's replacement string of:

Code:
\1 \2
there, you'd have to use:

Code:
$1 $2
Again, it is the regex version. You should choose Python under "Flavor" to test calibre regex's.
davidfor is offline   Reply With Quote
Old 06-18-2022, 01:59 AM   #13
jordy1955
Junior Member
jordy1955 began at the beginning.
 
Posts: 7
Karma: 10
Join Date: Aug 2021
Device: Kindle
Awesome stuff guys. Just ran it on a book and - once I got my head around it properly - I completed the editing and re-formatting in about 1hr - about 4 hours less than it usually takes me.
I'll get much quicker with practice but this is great.

Again, thanks SO MUCH.

Paul
jordy1955 is offline   Reply With Quote
Old 06-18-2022, 02:20 AM   #14
gbm
Wizard
gbm ought to be getting tired of karma fortunes by now.gbm ought to be getting tired of karma fortunes by now.gbm ought to be getting tired of karma fortunes by now.gbm ought to be getting tired of karma fortunes by now.gbm ought to be getting tired of karma fortunes by now.gbm ought to be getting tired of karma fortunes by now.gbm ought to be getting tired of karma fortunes by now.gbm ought to be getting tired of karma fortunes by now.gbm ought to be getting tired of karma fortunes by now.gbm ought to be getting tired of karma fortunes by now.gbm ought to be getting tired of karma fortunes by now.
 
Posts: 2,188
Karma: 8888888
Join Date: Jun 2010
Device: Kobo Clara HD,Hisence Sero 7 Pro RIP, Nook STR, jetbook lite
Link to the calibre Quick reference for regexp syntax.

bernie

Quote:
Originally Posted by jordy1955 View Post
Thanks Guys, Awesome!
Just fixed a book in about 1hr where it would have taken me about 5hrs previously - fixing one instance at a time...

I'll get quicker after some practice.
Again, thank you SO MUCH!

Paul
gbm is offline   Reply With Quote
Old 06-18-2022, 09:37 AM   #15
enuddleyarbl
Guru
enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.
 
enuddleyarbl's Avatar
 
Posts: 776
Karma: 1538394
Join Date: Sep 2013
Device: Kobo Forma
Quote:
Originally Posted by davidfor View Post
No, its a Regex version thing. Calibre uses Python's regex. That doesn't need the forward slash to be escaped. For others, such as PCRE, it will need to be escaped.

Again, it is the regex version. You should choose Python under "Flavor" to test calibre regex's.
I just double-checked, and you're right. I didn't know I could change the behavior of that regex101 site by changing the flavor and I think that's where I got the idea the Calibre editor didn't like those / symbols without escaping.

This has been a productive thread for me: I found a much better search/replace for fixing badly split paragraphs, I learned that I could change the behavior of the regex101 site to match Calibre's editor, and some of the search strings I use will be easier now that I won't have to escape the / character. Thanks.
enuddleyarbl is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
pdf regex question - regex that wraps to a new line flyash Conversion 1 09-05-2021 09:00 AM
Predefined regex for Regex-function sherman Editor 3 01-19-2020 05:32 AM
Regex help please FrostWolf Library Management 2 09-23-2014 11:50 PM
RegEx Help ghostyjack Workshop 4 03-22-2012 09:24 AM
Regex Gunnerp245 Conversion 5 03-05-2012 04:15 PM


All times are GMT -4. The time now is 02:12 PM.


MobileRead.com is a privately owned, operated and funded community.