![]() |
#1 |
Zealot
![]() ![]() ![]() ![]() Posts: 143
Karma: 387
Join Date: Sep 2010
Device: Kindle 3
|
Is this as it supposed to be? (Regexp issue?)
Hi,
I just noticed something: if I do a search&replace on the author Björn, Weiße (made it up for the example's sake) with the regexp Code:
(\w+), (\w+) Code:
|\1|\2| Code:
Weiß|e|Bj|örn Thanxx, Mixx PS: On "Weisse, Bjoern" I do get "|Weisse|Bjoern|", of course. |
![]() |
![]() |
![]() |
#2 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,176
Karma: 27110894
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Use \S+ instead
|
![]() |
![]() |
Advert | |
|
![]() |
#3 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 3,130
Karma: 91256
Join Date: Feb 2008
Location: Germany
Device: Cybook Gen3
|
Does Calibre set the locale according to the language selection? If so, adding the "(?L)" flag might help, as well.
|
![]() |
![]() |
![]() |
#4 |
Sigil & calibre developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,487
Karma: 1063785
Join Date: Jan 2009
Location: Florida, USA
Device: Nook STR
|
No. Search and replace does not set any flags by default. Flags such as m, u, s... all must be set by the person writing the regex using (?...) as needed.
|
![]() |
![]() |
![]() |
#5 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 3,130
Karma: 91256
Join Date: Feb 2008
Location: Germany
Device: Cybook Gen3
|
That's not what I meant. In the manual, there's references to a locale that can be set in the context of a Python program. If Calibre sets that locale to be the same as the user selected language, then adding the flag "(?L)" to the expression should alter "\w" to include special characters like "ö" and "ß" in the case of the german locale.
|
![]() |
![]() |
Advert | |
|
![]() |
#6 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
In order to support non-ascii with \w+, you need to add the unicode flag to the regex - (?u) IIRC. Kovid's solution of \S+ is something that works across all regex implementations though, no need to teach/confuse the user about flags which are python specific.
Locale might make a difference, but I'm not really sure on that point... The unicode flag is generally what's used for this issue. |
![]() |
![]() |
![]() |
#7 |
Zealot
![]() ![]() ![]() ![]() Posts: 143
Karma: 387
Join Date: Sep 2010
Device: Kindle 3
|
But isn't the definition of \w an alphabetic character? Or is it ASCII alphabetic character?
Depending on the sorting order, ö is within the set [a-z] (a..oö..z) or or outside (a..o..z..ö). But I thought that is set by the LOCALE and I was delighted to be able to set Calibre (via a tweak) to other sort order than just ASCII. This was a major improvement for me. I'd expect that Calibre/Python all read the LOCALE and interpret \w accordingly. Unless this is a Python issue, not a Calibre issue. I understand Kovid's response, but strictly speaking non-whitespace (\S) is not equal to alphabetic (\w). Anyway, did not want to start a religious discussion, just wanted to point out that this is not the expected behavior outside of English and therefore an opportunity to improve Calibre even further. Thanxx, Mixx |
![]() |
![]() |
![]() |
#8 | |
Sigil & calibre developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,487
Karma: 1063785
Join Date: Jan 2009
Location: Florida, USA
Device: Nook STR
|
Quote:
|
|
![]() |
![]() |
![]() |
#9 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 3,130
Karma: 91256
Join Date: Feb 2008
Location: Germany
Device: Cybook Gen3
|
Just to get it out of the way: I'm continuing this as an academic discussion, Kovid's solution to use \S+ seems to be the best way.
Before I answered, I tried the regex in plain Python. I found that using (?u) didn't work, and I couldn't properly test (?L), because I couldn't figure out how to set the locale within the five minutes or so I spent on the problem. That's why I was asking if Calibre sets that in its code. |
![]() |
![]() |
![]() |
#10 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,176
Karma: 27110894
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
locale is set in calibre.startup, but not based on the language in the calibre interface, but whatever python thinks is the system's locale.
|
![]() |
![]() |
![]() |
#11 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 3,130
Karma: 91256
Join Date: Feb 2008
Location: Germany
Device: Cybook Gen3
|
So, using the L flag should work. I'll have to try it in Calibre tomorrow.
|
![]() |
![]() |
![]() |
#12 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
(?u) should have worked, just doublechecked the docs - was this what you tried?:
Code:
(?u)(\w+), (\w+) I can't say that I'm a big fan of the Locale option after thinking about it - based on the Python regex docs that would work, but it would only work for one locale - if you had authors with non-ascii characters from other locales it wouldn't work - a common scenario for translated works. Last edited by ldolse; 03-08-2011 at 06:57 PM. |
![]() |
![]() |
![]() |
#13 |
Zealot
![]() ![]() ![]() ![]() Posts: 143
Karma: 387
Join Date: Sep 2010
Device: Kindle 3
|
|
![]() |
![]() |
![]() |
#14 | ||
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 3,130
Karma: 91256
Join Date: Feb 2008
Location: Germany
Device: Cybook Gen3
|
Quote:
Quote:
|
||
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Error in Regexp documentation | arifzaman | Calibre | 3 | 03-02-2011 06:03 AM |
removing articles from feeds with regexp | Alexis | Recipes | 1 | 01-17-2011 07:44 PM |
Regexp and Alternate Page Header/Footer | adad | Calibre | 5 | 01-15-2011 09:03 PM |
Multiple line regexp | janvanmaar | Calibre | 19 | 11-02-2010 01:02 PM |
Regexp and header/footer problems | concern | Calibre | 0 | 02-07-2010 03:35 AM |