MobileRead Forums - View Single Post - Delete paragraphs in scanned books (S & R with regexes)

Tex2002ans · 04-27-2016, 02:45 PM

Quote:

Originally Posted by chaot

Code comparison, what kind of tool will do the job best? I got Beyond Compare, Meld and KDiff3.

I personally use Beyond Compare (I found it was more accurate compared to the other programs I tested).

There is also Calibre's built-in compare: "File" -> "Compare to another book".

Quote:

Originally Posted by chaot

Probably you mean key parts of the code!?

The code (HTML tags) or the text (words/sentences). Both of these might get broken if you made a mistake when typing your Regex!

You might have made a typo and accidentally change:

Code:

<p>This is a sample sentence. 192</p>

<p>This is a sample sentence too.</p>

into:

Code:

<p>This is a sample sentence.

This is a sample sentence too.</p>

or:

Code:

<p>This is a sampleThis is a sample sentence too.</p>

Sometimes it is very hard to spot the error, and you don't see it until hours later when it is too late (you already made hundreds of other changes and corrections).

I just did this a few days ago... I accidentally typed an extra period in my Regex, and the second character of words were deleted ("Then" -> "Ten", "Suing" -> "Sing"). I didn't notice until later in the day that I made the mistake, and I had to manually correct many of the words.

Quote:

Originally Posted by chaot

Note: Adding an Example 3a in Post #1 (same book as Example 3)

Nothing is special about Example 3a.

Search: [0-9]+\s+

All that was added was the Blue code.

Note: If it was up to me, I strip out all the crap/useless code FIRST... then I could treat Example 3a just like Example 3.

Quote:

Originally Posted by chaot

The regex here, however, should eliminate with [306] also a blank space.
Don't be angry, I'm relatively sure the solution (for the elimination of a space) is
to find anywhere - only I would like a little sense of achievement quick and now.

I personally just run "Prettify Code" and that fixes the multi-space issues.

You could also just add spaces in the Regex to match your specific book.

Like Example #5 can turn into:

Search: *SPACE*\[[0-9]+\]*SPACE*

Also, you can just do a normal Search/Replace after everything to manually fix the "lots of spaces in a row" problem:

Search: *SPACE**SPACE*
Replace: *SPACE*

Quote:

Originally Posted by chaot

What's the different in S&R between setting Regex and Regex-Function?

https://manual.calibre-ebook.com/function_mode.html

I never used it before... but Regex-Function seems to allow you to use Python code for more powerful Search/Replace.

Quote:

Originally Posted by chaot

Stupid question!? Are these regaxes also fit for calibre?

Yes, I believe Sigil/Calibre use the same Regex Engine. At least all of the Regexes I have tested all work between Sigil/Calibre.

Quote:

Originally Posted by chaot

Would may be worth to create out of all these examples there something like a (regax examples) library - you know, cataloged and without bla-bla.

I don't believe there is a collection like that. Once you really learn the basics (by reading regular-expressions.info), you could really come up with all the Regex by yourself. That is what I do for the most part, I just create them on-the-fly as I need them... because each book's code comes with its own problems.

As you can see, a book might have:

a bold page number
an italic page number
a page number on its own line
a page number in the middle of text.
a bold+italic page number
[###]
(###)
###
###
###
###
[...]

It would make no sense to create a giant list of Regex for each of those... because they all follow the same basic rules!

I would just visit regular-expressions.info and follow along with the tutorials. It has lots of examples to learn from!