![]() |
#1 |
Junior Member
![]() Posts: 7
Karma: 10
Join Date: Aug 2021
Device: Kindle
|
Need help with regex
Hi,
Firstly let me say that I am a very rudimentary user of regex. Most of it is beyond my comprehension. I have some eBooks that were clearly produced by less than spectacular OCR software. Accordingly, the formatting ranges from quite good to really bad. One of the main problems is line breaks in the wrong places (eg in the middle of a sentence), making the text very difficult to follow. In F&R I have used this "[a-z]</p><p class="calibre_1">" - or similar - to quite successfully find these instances, but the problem is that the entirety of the matched regex is selected and I cannot for the life of me work out how to get the replace function to disregard the [a-z] component of the result in order to avoid what can be hundreds of manual interventions to fix all the errors. Any assistance is gratefully accepted. thanks Paul Last edited by jordy1955; 06-17-2022 at 09:02 PM. |
![]() |
![]() |
![]() |
#2 |
Fanatic
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 518
Karma: 2268308
Join Date: Nov 2015
Device: none
|
Use
(?<=\p{Ll})</p>\s*<p class="..."> |
![]() |
![]() |
Advert | |
|
![]() |
#3 |
Junior Member
![]() Posts: 7
Karma: 10
Join Date: Aug 2021
Device: Kindle
|
|
![]() |
![]() |
![]() |
#4 |
Junior Member
![]() Posts: 7
Karma: 10
Join Date: Aug 2021
Device: Kindle
|
|
![]() |
![]() |
![]() |
#5 |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 24,905
Karma: 47303824
Join Date: Jul 2011
Location: Sydney, Australia
Device: Kobo:Touch,Glo, AuraH2O, GloHD,AuraONE, ClaraHD, Libra H2O; tolinoepos
|
That doesn't work for me and I can't work put what the look behind is supposed to do.
I use: Code:
([\w,—])</p>\s*<p\s*[^>]*?>([\w]) Code:
\1 \2 |
![]() |
![]() |
Advert | |
|
![]() |
#6 |
Junior Member
![]() Posts: 7
Karma: 10
Join Date: Aug 2021
Device: Kindle
|
This is what my query returns.
I need to exclude the Single char - in this case the "E" - either in the search result or exclude it in the replace function. Last edited by jordy1955; 06-17-2022 at 10:20 PM. Reason: typo |
![]() |
![]() |
![]() |
#7 | |
Junior Member
![]() Posts: 7
Karma: 10
Join Date: Aug 2021
Device: Kindle
|
Quote:
This works, BUT, it also returns the 1st char of the following word - see image How then do I exclude the unwanted chars in the replace field? i've got no idea what the \1 \2 means Last edited by jordy1955; 06-17-2022 at 10:24 PM. |
|
![]() |
![]() |
![]() |
#8 | |
Junior Member
![]() Posts: 7
Karma: 10
Join Date: Aug 2021
Device: Kindle
|
Quote:
thankyou so much. You have saved me hours of manual intervention and frustration |
|
![]() |
![]() |
![]() |
#9 | |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 24,905
Karma: 47303824
Join Date: Jul 2011
Location: Sydney, Australia
Device: Kobo:Touch,Glo, AuraH2O, GloHD,AuraONE, ClaraHD, Libra H2O; tolinoepos
|
Quote:
Another I have used recently was: Code:
([[:lower:]])\s*</p>\s*<p>\s*([[:lower:]]) And this one doesn't cater for the class. If I am doing this amount of fixing, I remove the class for the normal paragraph. If there are any left, it probably means there is other formatting that I probably don't want to lose. |
|
![]() |
![]() |
![]() |
#10 | |
Guru
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 776
Karma: 1538394
Join Date: Sep 2013
Device: Kobo Forma
|
Quote:
Code:
([\w,—])<\/p>\s*<p\s*[^>]*?>([\w]) |
|
![]() |
![]() |
![]() |
#11 |
Guru
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 776
Karma: 1538394
Join Date: Sep 2013
Device: Kobo Forma
|
@jordy1955: I've been using
https://regex101.com/ to try various regex things and see what they do. It's been a lot of help. One thing to note, though, the replacement character they use there is a $ instead of the \ used in Calibre's editor. So, if you wanted to test davidfor's replacement string of: Code:
\1 \2 Code:
$1 $2 |
![]() |
![]() |
![]() |
#12 | ||
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 24,905
Karma: 47303824
Join Date: Jul 2011
Location: Sydney, Australia
Device: Kobo:Touch,Glo, AuraH2O, GloHD,AuraONE, ClaraHD, Libra H2O; tolinoepos
|
Quote:
Quote:
|
||
![]() |
![]() |
![]() |
#13 |
Junior Member
![]() Posts: 7
Karma: 10
Join Date: Aug 2021
Device: Kindle
|
Awesome stuff guys. Just ran it on a book and - once I got my head around it properly - I completed the editing and re-formatting in about 1hr - about 4 hours less than it usually takes me.
I'll get much quicker with practice but this is great. Again, thanks SO MUCH. Paul |
![]() |
![]() |
![]() |
#14 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,188
Karma: 8888888
Join Date: Jun 2010
Device: Kobo Clara HD,Hisence Sero 7 Pro RIP, Nook STR, jetbook lite
|
|
![]() |
![]() |
![]() |
#15 | |
Guru
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 776
Karma: 1538394
Join Date: Sep 2013
Device: Kobo Forma
|
Quote:
This has been a productive thread for me: I found a much better search/replace for fixing badly split paragraphs, I learned that I could change the behavior of the regex101 site to match Calibre's editor, and some of the search strings I use will be easier now that I won't have to escape the / character. Thanks. |
|
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
pdf regex question - regex that wraps to a new line | flyash | Conversion | 1 | 09-05-2021 09:00 AM |
Predefined regex for Regex-function | sherman | Editor | 3 | 01-19-2020 05:32 AM |
Regex help please | FrostWolf | Library Management | 2 | 09-23-2014 11:50 PM |
RegEx Help | ghostyjack | Workshop | 4 | 03-22-2012 09:24 AM |
Regex | Gunnerp245 | Conversion | 5 | 03-05-2012 04:15 PM |