Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Editor

Notices

Reply
 
Thread Tools Search this Thread
Old 03-05-2024, 10:07 AM   #1
uhi711
Member
uhi711 began at the beginning.
 
Posts: 22
Karma: 10
Join Date: Mar 2024
Device: Kindle Oasis
Help with regexp

Hi,

Can somebody help me build a regexp for the following Search and (maybe) Replace problem?
I have some PDF books I've converted to AZW3 to use on my Kindle Oasis. The conversion process works fine, but there are some problems in the converted files, which, I think, I have to correct manually: some of the sentences are split into several parts, by the the conversion module. I can see this in the HTML files, there are two strings, one constant and one dynamic, that appear, sometimes, in the middle of a sentence. Usually, if I replace these strings with a blank, I correct the split problem. The problem is that these strings appear in many places in the HTML files, where they are needed, so I can't replace all of them with blanks. That's why, I think, I have to do the Replace part manually, but I would like to find all occurrences of both strings in only one search operation and decide if the replacement is needed at that position in the HTML file.

I'm new to Calibre and I don't know Python, but I would like to build a search expression to replace, if needed, the following strings:

</p> <p class="calibre1"> (this string is not a problem to find and replace because it is not changing)

OR

</p> <p class="calibre1"><a id="p128"></a> (this string is dynamic, the number after the "p and its number of digits change).

So, I would like to use this expression in the calibre editor, with the Find and Replace module, probably in Mode Regex, to find all occurrences of these strings and eventually replace them with a space.
Any help will be much appreciated.

Thank you in advance,
Daniel
uhi711 is offline   Reply With Quote
Old 03-05-2024, 12:06 PM   #2
theducks
Well trained by Cats
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 29,820
Karma: 54830978
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
Code:
</p> <p class="calibre1"><a id="p\d+"></a>
\d+ 1 or more digits
Anchors should not cause issues. They should be invisible.

OTOH a link to a missing anchor can be problematic (besides not working)
theducks is offline   Reply With Quote
Advert
Old 03-05-2024, 12:22 PM   #3
retiredbiker
Addict
retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.
 
retiredbiker's Avatar
 
Posts: 387
Karma: 1638210
Join Date: May 2013
Location: Ontario, Canada
Device: Kindle KB, Oasis, Pop_Os!, Jutoh, Kobo Forma
When you go to convert a pdf, anything can happen, because a pdf can hold just about anything. If you are getting decent text results, it means the pdf has some text that somebody put there, probably from running Optical Character Recognition (OCR) on it. And you are lucky. Many pdf files won't convert at all or will give horrible results. See the sticky post here: https://www.mobileread.com/forums/sh...d.php?t=118605

You will get all sorts of repeating glitches in these conversions, so you will need a variety of Regex search and replace strings to deal with them. There is a really good Regex tutorial specifically for Calibre in the Manual: https://manual.calibre-ebook.com/reg...regexptutorial You will need to be flexible, so you really need to learn some simple Regex---but this sort of editing will mostly use very simple searches.

Just to get you started, that problem with paragraphs ending in the wrong place, or each line being a paragraph, is because pdf has no concept of a paragraph, it is more like a picture of the page. Turn on heuristic processing during conversion and it will probably fix many or most of these.

So you will probably still find some paragraphs ending with a lower case letter and the next starting with one:
Code:
...he went</p> <p class="calibre1">to the store...
As you say, removing the </p> <p class="calibre1"> will fix this, but if you remove every paragraph end-start, you will ruin the book. So look for a lower case letter, end para, maybe some space, and a start para with a first lower case letter:
Code:
([a-z])</p>\s+<p class="calibre1">([a-z])
Explanation: () traps what regex finds. ([a-z]) finds a lower case letter and remembers the letter. </p> and <p class="calibre1"> are just constants in the seatch. \s is a space, \s+ is any number of spaces. The second ([a-z]) remembers the second lower case letter.
You want to replace this with
Code:
\1 \2
where \1 is the letter remembered from the first ([a-z]) and \2 is the letter from the second ([a-z]). Note the space between them!

So set this up and carefully go into it one find/replace at a time to make sure it is working as expected. There may be exceptions to prevent you from doing a "replace all", but once you are comfortable with it, that may be possible. And of course the "calibre1" bit can change even within one book.

Depending on the book, you may also have paragraph errors where a paragraph ends with a , or a : or a : or a —. You get the idea. The above query can be easily modified to find these.

On your other point, finding changing numbers. \d finds a digit, and \d+ finds any string of digits. So to find all the <a id="p128"> sorts of things, search for
Code:
<a id="p\d+">
But be very careful when mass deleting anchors or IDs or any numbered things--you may wreck footnotes, TOC entries and so on.

As you get these searches working, save them for use on the next book. You can also modify a saved search on the fly...for example to find paras ending in , or ; and so on, a basic search saved will do the job for all with a couple of keystrokes.

And do work on a copy of the book while learning this...you are learning to handle dynamite.
retiredbiker is offline   Reply With Quote
Old 03-05-2024, 02:56 PM   #4
uhi711
Member
uhi711 began at the beginning.
 
Posts: 22
Karma: 10
Join Date: Mar 2024
Device: Kindle Oasis
Quote:
Originally Posted by theducks View Post
Code:
</p> <p class="calibre1"><a id="p\d+"></a>
\d+ 1 or more digits
Anchors should not cause issues. They should be invisible.
Thank you very much. I have to mention that the expression ....id="p\d+".... doesn't work, but ....id="p[\d]+".... works fine.
uhi711 is offline   Reply With Quote
Old 03-05-2024, 03:04 PM   #5
uhi711
Member
uhi711 began at the beginning.
 
Posts: 22
Karma: 10
Join Date: Mar 2024
Device: Kindle Oasis
@retiredbiker: Thank you so much. A constructive reply that put me on track.
With the same observation about the ....id="p\d+".... which doesn't work, but ....id="p[\d]+".... works fine.
uhi711 is offline   Reply With Quote
Advert
Old 03-05-2024, 04:16 PM   #6
theducks
Well trained by Cats
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 29,820
Karma: 54830978
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
Quote:
Originally Posted by uhi711 View Post
Thank you very much. I have to mention that the expression ....id="p\d+".... doesn't work, but ....id="p[\d]+".... works fine.
Very puzzled
Square brackets have meaning that should not be needed since there is only digits. FWIW p[0-9]+ should also be valid

I have used this for YEARS to clean onmouseovers
theducks is offline   Reply With Quote
Old 03-05-2024, 05:17 PM   #7
Quoth
the rook, bossing Never.
Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.
 
Quoth's Avatar
 
Posts: 11,173
Karma: 85874891
Join Date: Jun 2017
Location: Ireland
Device: All 4 Kinds: epub eink, Kindle, android eink, NxtPaper11
or p[0-9]* is similar? regex wrecks my head. Perl programming especially.
Quoth is offline   Reply With Quote
Old 03-05-2024, 05:56 PM   #8
theducks
Well trained by Cats
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 29,820
Karma: 54830978
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
Quote:
Originally Posted by Quoth View Post
or p[0-9]* is similar? regex wrecks my head. Perl programming especially.
character 0 thru 9 repeated. You can leave out a digit this way
[0-4,6-9] no 5
theducks is offline   Reply With Quote
Old 03-05-2024, 06:15 PM   #9
uhi711
Member
uhi711 began at the beginning.
 
Posts: 22
Karma: 10
Join Date: Mar 2024
Device: Kindle Oasis
Quote:
Originally Posted by theducks View Post
Very puzzled
Square brackets have meaning that should not be needed since there is only digits. FWIW p[0-9]+ should also be valid

I have used this for YEARS to clean onmouseovers
I understand, but i can't explain why. When your suggestion didn't work, I asked my daughter, who's using Python at work, and she suggested to use the square brackets. Suddenly everything worked. I use the latest version of calibre.
uhi711 is offline   Reply With Quote
Reply

Tags
regexp


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
RegExp Help kakkalla Library Management 4 06-30-2020 11:32 PM
Regexp help for saving books Phssthpok Library Management 3 06-24-2015 11:31 AM
Problem with a regexp Terisa de morgan Library Management 4 06-19-2015 02:57 PM
Need help with RegExp theichens Calibre 1 02-09-2013 08:40 AM
Regexp help - I think... paulfiera Calibre 4 07-20-2011 03:27 AM


All times are GMT -4. The time now is 11:01 AM.


MobileRead.com is a privately owned, operated and funded community.