View Single Post
Old 04-27-2016, 02:45 PM   #6
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by chaot View Post
Code comparison, what kind of tool will do the job best? I got Beyond Compare, Meld and KDiff3.
I personally use Beyond Compare (I found it was more accurate compared to the other programs I tested).

There is also Calibre's built-in compare: "File" -> "Compare to another book".

Quote:
Originally Posted by chaot View Post
Probably you mean key parts of the code!?
The code (HTML tags) or the text (words/sentences). Both of these might get broken if you made a mistake when typing your Regex!

You might have made a typo and accidentally change:

Code:
<p>This is a sample sentence. 192</p>

<p>This is a sample sentence too.</p>
into:

Code:
<p>This is a sample sentence.

This is a sample sentence too.</p>
or:

Code:
<p>This is a sampleThis is a sample sentence too.</p>
Sometimes it is very hard to spot the error, and you don't see it until hours later when it is too late (you already made hundreds of other changes and corrections).

I just did this a few days ago... I accidentally typed an extra period in my Regex, and the second character of words were deleted ("Then" -> "Ten", "Suing" -> "Sing"). I didn't notice until later in the day that I made the mistake, and I had to manually correct many of the words.

Quote:
Originally Posted by chaot View Post
Note: Adding an Example 3a in Post #1 (same book as Example 3)
Nothing is special about Example 3a.

Search: <b class="calibre3">[0-9]+</b></p>\s+<p class="calibre2">

All that was added was the Blue code.

Note: If it was up to me, I strip out all the crap/useless code FIRST... then I could treat Example 3a just like Example 3.

Quote:
Originally Posted by chaot View Post
The regex here, however, should eliminate with [306] also a blank space.
Don't be angry, I'm relatively sure the solution (for the elimination of a space) is
to find anywhere - only I would like a little sense of achievement quick and now.
I personally just run "Prettify Code" and that fixes the multi-space issues.

You could also just add spaces in the Regex to match your specific book.

Like Example #5 can turn into:

Search: *SPACE*\[[0-9]+\]*SPACE*

Also, you can just do a normal Search/Replace after everything to manually fix the "lots of spaces in a row" problem:

Search: *SPACE**SPACE*
Replace: *SPACE*

Quote:
Originally Posted by chaot View Post
What's the different in S&R between setting Regex and Regex-Function?
https://manual.calibre-ebook.com/function_mode.html

I never used it before... but Regex-Function seems to allow you to use Python code for more powerful Search/Replace.

Quote:
Originally Posted by chaot View Post
Stupid question!? Are these regaxes also fit for calibre?
Yes, I believe Sigil/Calibre use the same Regex Engine. At least all of the Regexes I have tested all work between Sigil/Calibre.

Quote:
Originally Posted by chaot View Post
Would may be worth to create out of all these examples there something like a (regax examples) library - you know, cataloged and without bla-bla.
I don't believe there is a collection like that. Once you really learn the basics (by reading regular-expressions.info), you could really come up with all the Regex by yourself. That is what I do for the most part, I just create them on-the-fly as I need them... because each book's code comes with its own problems.

As you can see, a book might have:
  • a bold page number
  • an italic page number
  • a page number on its own line
  • a page number in the middle of text.
  • a bold+italic page number
  • [###]
  • (###)
  • <b class="calibre#">###</b>
  • <b class="block#">###</b>
  • <span class="pagenumber">###</span>
  • <sup>###</sup>
  • [...]

It would make no sense to create a giant list of Regex for each of those... because they all follow the same basic rules!

I would just visit regular-expressions.info and follow along with the tutorials. It has lots of examples to learn from!

Last edited by Tex2002ans; 04-27-2016 at 02:53 PM.
Tex2002ans is offline   Reply With Quote