MobileRead Forums - View Single Post

DiapDealer · 08-14-2014, 07:38 AM

Quote:

Originally Posted by Papirus

Related with the regex engine, I've realized of two main differences: \K (but I can circumvent it using variable-length lookbehinds) and the conditional structure (?(condition)Then|Else); this one is an important limitation compared with PCRE.

I don't use really use if|then|else regex conditionals myself, but the regex module calibre's editor uses certainly supports them. Probably just a matter of getting the syntax right. For example:

Code:

(a)?b(?(1)c|d)

Matches both "bd" and "abc"

Quote:

Properties \p are well supported, but \p{Lu} (uppercase letter) and \p{Ll} (lowercase letter) only works correctly if "case sensitive" option is checked (I don't know if this is the expected behaviour). I've tried (?f) with no success.

This sort of hung me up for a bit too, but when you think about it ... searching specifically for lower- or upper-case letters is sort of the very definition of "case sensitivity," is it not?: hence the reason for the box needing to be checked. If you need case insensitivity, uncheck the box and ... use \p{L} in your expression instead.

Quote:

In another context, sometimes scanned text includes & shy; (soft hyphen), this is a hidden hyphen that you can't see (at least in Sigil) and the only way to remove it is regex searching \xAD. Here the problem is not with the regex engine but with file preview panel where it appears as a dot. A similar behaviour it's with & #8203; (Zero-width space), that is also represented as a dot and it's another hidden character that is used in very very very long words in order to break the paragraph avoiding text exceed the screen boundary in readers. Here \x{200B} regex is not allowed.

The syntax is different for matching specific unicode codepoints. instead of \x{FFFF} just use \uFFFF. So looking for your & shy character becomes \u00AD and the search for the zero-width space becomes \u200B. PCRE was really the odd man out with the \x{FFFF} sequence.