03-05-2024, 10:07 AM | #1 |
Member
Posts: 22
Karma: 10
Join Date: Mar 2024
Device: Kindle Oasis
|
Help with regexp
Hi,
Can somebody help me build a regexp for the following Search and (maybe) Replace problem? I have some PDF books I've converted to AZW3 to use on my Kindle Oasis. The conversion process works fine, but there are some problems in the converted files, which, I think, I have to correct manually: some of the sentences are split into several parts, by the the conversion module. I can see this in the HTML files, there are two strings, one constant and one dynamic, that appear, sometimes, in the middle of a sentence. Usually, if I replace these strings with a blank, I correct the split problem. The problem is that these strings appear in many places in the HTML files, where they are needed, so I can't replace all of them with blanks. That's why, I think, I have to do the Replace part manually, but I would like to find all occurrences of both strings in only one search operation and decide if the replacement is needed at that position in the HTML file. I'm new to Calibre and I don't know Python, but I would like to build a search expression to replace, if needed, the following strings: </p> <p class="calibre1"> (this string is not a problem to find and replace because it is not changing) OR </p> <p class="calibre1"><a id="p128"></a> (this string is dynamic, the number after the "p and its number of digits change). So, I would like to use this expression in the calibre editor, with the Find and Replace module, probably in Mode Regex, to find all occurrences of these strings and eventually replace them with a space. Any help will be much appreciated. Thank you in advance, Daniel |
03-05-2024, 12:06 PM | #2 |
Well trained by Cats
Posts: 29,985
Karma: 56143930
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
|
Code:
</p> <p class="calibre1"><a id="p\d+"></a>
Anchors should not cause issues. They should be invisible. OTOH a link to a missing anchor can be problematic (besides not working) |
03-05-2024, 12:22 PM | #3 |
Addict
Posts: 390
Karma: 1638210
Join Date: May 2013
Location: Ontario, Canada
Device: Kindle KB, Oasis, Pop_Os!, Jutoh, Kobo Forma
|
When you go to convert a pdf, anything can happen, because a pdf can hold just about anything. If you are getting decent text results, it means the pdf has some text that somebody put there, probably from running Optical Character Recognition (OCR) on it. And you are lucky. Many pdf files won't convert at all or will give horrible results. See the sticky post here: https://www.mobileread.com/forums/sh...d.php?t=118605
You will get all sorts of repeating glitches in these conversions, so you will need a variety of Regex search and replace strings to deal with them. There is a really good Regex tutorial specifically for Calibre in the Manual: https://manual.calibre-ebook.com/reg...regexptutorial You will need to be flexible, so you really need to learn some simple Regex---but this sort of editing will mostly use very simple searches. Just to get you started, that problem with paragraphs ending in the wrong place, or each line being a paragraph, is because pdf has no concept of a paragraph, it is more like a picture of the page. Turn on heuristic processing during conversion and it will probably fix many or most of these. So you will probably still find some paragraphs ending with a lower case letter and the next starting with one: Code:
...he went</p> <p class="calibre1">to the store... Code:
([a-z])</p>\s+<p class="calibre1">([a-z]) You want to replace this with Code:
\1 \2 So set this up and carefully go into it one find/replace at a time to make sure it is working as expected. There may be exceptions to prevent you from doing a "replace all", but once you are comfortable with it, that may be possible. And of course the "calibre1" bit can change even within one book. Depending on the book, you may also have paragraph errors where a paragraph ends with a , or a : or a : or a —. You get the idea. The above query can be easily modified to find these. On your other point, finding changing numbers. \d finds a digit, and \d+ finds any string of digits. So to find all the <a id="p128"> sorts of things, search for Code:
<a id="p\d+"> As you get these searches working, save them for use on the next book. You can also modify a saved search on the fly...for example to find paras ending in , or ; and so on, a basic search saved will do the job for all with a couple of keystrokes. And do work on a copy of the book while learning this...you are learning to handle dynamite. |
03-05-2024, 02:56 PM | #4 |
Member
Posts: 22
Karma: 10
Join Date: Mar 2024
Device: Kindle Oasis
|
|
03-05-2024, 03:04 PM | #5 |
Member
Posts: 22
Karma: 10
Join Date: Mar 2024
Device: Kindle Oasis
|
@retiredbiker: Thank you so much. A constructive reply that put me on track.
With the same observation about the ....id="p\d+".... which doesn't work, but ....id="p[\d]+".... works fine. |
03-05-2024, 04:16 PM | #6 | |
Well trained by Cats
Posts: 29,985
Karma: 56143930
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
|
Quote:
Square brackets have meaning that should not be needed since there is only digits. FWIW p[0-9]+ should also be valid I have used this for YEARS to clean onmouseovers |
|
03-05-2024, 05:17 PM | #7 |
the rook, bossing Never.
Posts: 11,699
Karma: 87663461
Join Date: Jun 2017
Location: Ireland
Device: All 4 Kinds: epub eink, Kindle, android eink, NxtPaper11
|
or p[0-9]* is similar? regex wrecks my head. Perl programming especially.
|
03-05-2024, 05:56 PM | #8 |
Well trained by Cats
Posts: 29,985
Karma: 56143930
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
|
|
03-05-2024, 06:15 PM | #9 |
Member
Posts: 22
Karma: 10
Join Date: Mar 2024
Device: Kindle Oasis
|
I understand, but i can't explain why. When your suggestion didn't work, I asked my daughter, who's using Python at work, and she suggested to use the square brackets. Suddenly everything worked. I use the latest version of calibre.
|
Tags |
regexp |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
RegExp Help | kakkalla | Library Management | 4 | 06-30-2020 11:32 PM |
Regexp help for saving books | Phssthpok | Library Management | 3 | 06-24-2015 11:31 AM |
Problem with a regexp | Terisa de morgan | Library Management | 4 | 06-19-2015 02:57 PM |
Need help with RegExp | theichens | Calibre | 1 | 02-09-2013 08:40 AM |
Regexp help - I think... | paulfiera | Calibre | 4 | 07-20-2011 03:27 AM |