MobileRead Forums - View Single Post

retiredbiker · 03-05-2024, 12:22 PM

When you go to convert a pdf, anything can happen, because a pdf can hold just about anything. If you are getting decent text results, it means the pdf has some text that somebody put there, probably from running Optical Character Recognition (OCR) on it. And you are lucky. Many pdf files won't convert at all or will give horrible results. See the sticky post here: https://www.mobileread.com/forums/sh...d.php?t=118605

You will get all sorts of repeating glitches in these conversions, so you will need a variety of Regex search and replace strings to deal with them. There is a really good Regex tutorial specifically for Calibre in the Manual: https://manual.calibre-ebook.com/reg...regexptutorial You will need to be flexible, so you really need to learn some simple Regex---but this sort of editing will mostly use very simple searches.

Just to get you started, that problem with paragraphs ending in the wrong place, or each line being a paragraph, is because pdf has no concept of a paragraph, it is more like a picture of the page. Turn on heuristic processing during conversion and it will probably fix many or most of these.

So you will probably still find some paragraphs ending with a lower case letter and the next starting with one:

Code:

...he went</p> <p class="calibre1">to the store...

As you say, removing the </p> <p class="calibre1"> will fix this, but if you remove every paragraph end-start, you will ruin the book. So look for a lower case letter, end para, maybe some space, and a start para with a first lower case letter:

Code:

([a-z])</p>\s+<p class="calibre1">([a-z])

Explanation: () traps what regex finds. ([a-z]) finds a lower case letter and remembers the letter. </p> and <p class="calibre1"> are just constants in the seatch. \s is a space, \s+ is any number of spaces. The second ([a-z]) remembers the second lower case letter.
You want to replace this with

Code:

\1 \2

where \1 is the letter remembered from the first ([a-z]) and \2 is the letter from the second ([a-z]). Note the space between them!

So set this up and carefully go into it one find/replace at a time to make sure it is working as expected. There may be exceptions to prevent you from doing a "replace all", but once you are comfortable with it, that may be possible. And of course the "calibre1" bit can change even within one book.

Depending on the book, you may also have paragraph errors where a paragraph ends with a , or a : or a : or a —. You get the idea. The above query can be easily modified to find these.

On your other point, finding changing numbers. \d finds a digit, and \d+ finds any string of digits. So to find all the <a id="p128"> sorts of things, search for

Code:

<a id="p\d+">

But be very careful when mass deleting anchors or IDs or any numbered things--you may wreck footnotes, TOC entries and so on.

As you get these searches working, save them for use on the next book. You can also modify a saved search on the fly...for example to find paras ending in , or ; and so on, a basic search saved will do the job for all with a couple of keystrokes.

And do work on a copy of the book while learning this...you are learning to handle dynamite.

03-05-2024, 12:22 PM	#3
retiredbiker Evangelist Posts: 451 Karma: 3886916 Join Date: May 2013 Location: Ontario, Canada Device: Kindle KB, Oasis, Pop_Os!, Kobo Forma	When you go to convert a pdf, anything can happen, because a pdf can hold just about anything. If you are getting decent text results, it means the pdf has some text that somebody put there, probably from running Optical Character Recognition (OCR) on it. And you are lucky. Many pdf files won't convert at all or will give horrible results. See the sticky post here: https://www.mobileread.com/forums/sh...d.php?t=118605 You will get all sorts of repeating glitches in these conversions, so you will need a variety of Regex search and replace strings to deal with them. There is a really good Regex tutorial specifically for Calibre in the Manual: https://manual.calibre-ebook.com/reg...regexptutorial You will need to be flexible, so you really need to learn some simple Regex---but this sort of editing will mostly use very simple searches. Just to get you started, that problem with paragraphs ending in the wrong place, or each line being a paragraph, is because pdf has no concept of a paragraph, it is more like a picture of the page. Turn on heuristic processing during conversion and it will probably fix many or most of these. So you will probably still find some paragraphs ending with a lower case letter and the next starting with one: Code: ...he went</p> <p class="calibre1">to the store... As you say, removing the </p> <p class="calibre1"> will fix this, but if you remove every paragraph end-start, you will ruin the book. So look for a lower case letter, end para, maybe some space, and a start para with a first lower case letter: Code: ([a-z])</p>\s+<p class="calibre1">([a-z]) Explanation: () traps what regex finds. ([a-z]) finds a lower case letter and remembers the letter. </p> and <p class="calibre1"> are just constants in the seatch. \s is a space, \s+ is any number of spaces. The second ([a-z]) remembers the second lower case letter. You want to replace this with Code: \1 \2 where \1 is the letter remembered from the first ([a-z]) and \2 is the letter from the second ([a-z]). Note the space between them! So set this up and carefully go into it one find/replace at a time to make sure it is working as expected. There may be exceptions to prevent you from doing a "replace all", but once you are comfortable with it, that may be possible. And of course the "calibre1" bit can change even within one book. Depending on the book, you may also have paragraph errors where a paragraph ends with a , or a : or a : or a —. You get the idea. The above query can be easily modified to find these. On your other point, finding changing numbers. \d finds a digit, and \d+ finds any string of digits. So to find all the <a id="p128"> sorts of things, search for Code: <a id="p\d+"> But be very careful when mass deleting anchors or IDs or any numbered things--you may wreck footnotes, TOC entries and so on. As you get these searches working, save them for use on the next book. You can also modify a saved search on the fly...for example to find paras ending in , or ; and so on, a basic search saved will do the job for all with a couple of keystrokes. And do work on a copy of the book while learning this...you are learning to handle dynamite.