View Single Post
Old 03-05-2024, 12:22 PM   #3
retiredbiker
Addict
retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.
 
retiredbiker's Avatar
 
Posts: 389
Karma: 1638210
Join Date: May 2013
Location: Ontario, Canada
Device: Kindle KB, Oasis, Pop_Os!, Jutoh, Kobo Forma
When you go to convert a pdf, anything can happen, because a pdf can hold just about anything. If you are getting decent text results, it means the pdf has some text that somebody put there, probably from running Optical Character Recognition (OCR) on it. And you are lucky. Many pdf files won't convert at all or will give horrible results. See the sticky post here: https://www.mobileread.com/forums/sh...d.php?t=118605

You will get all sorts of repeating glitches in these conversions, so you will need a variety of Regex search and replace strings to deal with them. There is a really good Regex tutorial specifically for Calibre in the Manual: https://manual.calibre-ebook.com/reg...regexptutorial You will need to be flexible, so you really need to learn some simple Regex---but this sort of editing will mostly use very simple searches.

Just to get you started, that problem with paragraphs ending in the wrong place, or each line being a paragraph, is because pdf has no concept of a paragraph, it is more like a picture of the page. Turn on heuristic processing during conversion and it will probably fix many or most of these.

So you will probably still find some paragraphs ending with a lower case letter and the next starting with one:
Code:
...he went</p> <p class="calibre1">to the store...
As you say, removing the </p> <p class="calibre1"> will fix this, but if you remove every paragraph end-start, you will ruin the book. So look for a lower case letter, end para, maybe some space, and a start para with a first lower case letter:
Code:
([a-z])</p>\s+<p class="calibre1">([a-z])
Explanation: () traps what regex finds. ([a-z]) finds a lower case letter and remembers the letter. </p> and <p class="calibre1"> are just constants in the seatch. \s is a space, \s+ is any number of spaces. The second ([a-z]) remembers the second lower case letter.
You want to replace this with
Code:
\1 \2
where \1 is the letter remembered from the first ([a-z]) and \2 is the letter from the second ([a-z]). Note the space between them!

So set this up and carefully go into it one find/replace at a time to make sure it is working as expected. There may be exceptions to prevent you from doing a "replace all", but once you are comfortable with it, that may be possible. And of course the "calibre1" bit can change even within one book.

Depending on the book, you may also have paragraph errors where a paragraph ends with a , or a : or a : or a —. You get the idea. The above query can be easily modified to find these.

On your other point, finding changing numbers. \d finds a digit, and \d+ finds any string of digits. So to find all the <a id="p128"> sorts of things, search for
Code:
<a id="p\d+">
But be very careful when mass deleting anchors or IDs or any numbered things--you may wreck footnotes, TOC entries and so on.

As you get these searches working, save them for use on the next book. You can also modify a saved search on the fly...for example to find paras ending in , or ; and so on, a basic search saved will do the job for all with a couple of keystrokes.

And do work on a copy of the book while learning this...you are learning to handle dynamite.
retiredbiker is offline   Reply With Quote