Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre

Notices

Reply
 
Thread Tools Search this Thread
Old 03-08-2011, 12:33 PM   #1
Mixx
Zealot
Mixx has a complete set of Star Wars action figures.Mixx has a complete set of Star Wars action figures.Mixx has a complete set of Star Wars action figures.Mixx has a complete set of Star Wars action figures.
 
Posts: 138
Karma: 387
Join Date: Sep 2010
Device: Kindle 3
Is this as it supposed to be? (Regexp issue?)

Hi,

I just noticed something: if I do a search&replace on the author

Bj÷rn, Wei▀e (made it up for the example's sake)

with the regexp

Code:
(\w+), (\w+)
and replace it with

Code:
|\1|\2|
I get
Code:
Wei▀|e|Bj|÷rn
Is this as it should be? My LOCALE is currently German and would not one have to get a match for ÷ and ▀ on regexp \w? They seem to be treated as non-alphabetic characters.

Thanxx, Mixx

PS: On "Weisse, Bjoern" I do get "|Weisse|Bjoern|", of course.
Mixx is offline   Reply With Quote
Old 03-08-2011, 01:15 PM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 25,400
Karma: 4961459
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Use \S+ instead
kovidgoyal is offline   Reply With Quote
 
Enthusiast
Old 03-08-2011, 01:18 PM   #3
Manichean
Wizard
Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!
 
Manichean's Avatar
 
Posts: 3,130
Karma: 80446
Join Date: Feb 2008
Location: Germany
Device: Cybook Gen3
Does Calibre set the locale according to the language selection? If so, adding the "(?L)" flag might help, as well.
Manichean is offline   Reply With Quote
Old 03-08-2011, 02:20 PM   #4
user_none
Sigil & calibre developer
user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.
 
user_none's Avatar
 
Posts: 2,433
Karma: 950001
Join Date: Jan 2009
Location: Florida, USA
Device: Nook STR
Quote:
Originally Posted by Manichean View Post
Does Calibre set the locale according to the language selection? If so, adding the "(?L)" flag might help, as well.
No. Search and replace does not set any flags by default. Flags such as m, u, s... all must be set by the person writing the regex using (?...) as needed.
user_none is offline   Reply With Quote
Old 03-08-2011, 02:29 PM   #5
Manichean
Wizard
Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!
 
Manichean's Avatar
 
Posts: 3,130
Karma: 80446
Join Date: Feb 2008
Location: Germany
Device: Cybook Gen3
Quote:
Originally Posted by user_none View Post
No. Search and replace does not set any flags by default. Flags such as m, u, s... all must be set by the person writing the regex using (?...) as needed.
That's not what I meant. In the manual, there's references to a locale that can be set in the context of a Python program. If Calibre sets that locale to be the same as the user selected language, then adding the flag "(?L)" to the expression should alter "\w" to include special characters like "÷" and "▀" in the case of the german locale.
Manichean is offline   Reply With Quote
Old 03-08-2011, 04:03 PM   #6
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
In order to support non-ascii with \w+, you need to add the unicode flag to the regex - (?u) IIRC. Kovid's solution of \S+ is something that works across all regex implementations though, no need to teach/confuse the user about flags which are python specific.

Locale might make a difference, but I'm not really sure on that point... The unicode flag is generally what's used for this issue.
ldolse is offline   Reply With Quote
Old 03-08-2011, 04:40 PM   #7
Mixx
Zealot
Mixx has a complete set of Star Wars action figures.Mixx has a complete set of Star Wars action figures.Mixx has a complete set of Star Wars action figures.Mixx has a complete set of Star Wars action figures.
 
Posts: 138
Karma: 387
Join Date: Sep 2010
Device: Kindle 3
But isn't the definition of \w an alphabetic character? Or is it ASCII alphabetic character?

Depending on the sorting order, ÷ is within the set [a-z] (a..o÷..z) or or outside (a..o..z..÷). But I thought that is set by the LOCALE and I was delighted to be able to set Calibre (via a tweak) to other sort order than just ASCII. This was a major improvement for me.

I'd expect that Calibre/Python all read the LOCALE and interpret \w accordingly. Unless this is a Python issue, not a Calibre issue.

I understand Kovid's response, but strictly speaking
non-whitespace (\S) is not equal to alphabetic (\w).

Anyway, did not want to start a religious discussion, just wanted to point out that this is not the expected behavior outside of English and therefore an opportunity to improve Calibre even further.

Thanxx, Mixx
Mixx is offline   Reply With Quote
Old 03-08-2011, 04:48 PM   #8
user_none
Sigil & calibre developer
user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.
 
user_none's Avatar
 
Posts: 2,433
Karma: 950001
Join Date: Jan 2009
Location: Florida, USA
Device: Nook STR
Quote:
Originally Posted by Mixx View Post
But isn't the definition of \w an alphabetic character? Or is it ASCII alphabetic character?
Depending on the sorting order, ├ is within the set [a-z] (a..o├Â..z) or or outside (a..o..z..├Â). But I thought that is set by the LOCALE and I was delighted to be able to set Calibre (via a tweak) to other sort order than just ASCII. This was a major improvement for me.
I'd expect that Calibre/Python all read the LOCALE and interpret \w accordingly. Unless this is a Python issue, not a Calibre issue.
All regex handing is via Python's re module. Again you need to specify the proper flags, such as u. Otherwise the expected (Python) behavior is to include only a-z and A-Z as an alphebetic characters. See the Python re documentation for more information on this topic.
user_none is offline   Reply With Quote
Old 03-08-2011, 05:28 PM   #9
Manichean
Wizard
Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!
 
Manichean's Avatar
 
Posts: 3,130
Karma: 80446
Join Date: Feb 2008
Location: Germany
Device: Cybook Gen3
Just to get it out of the way: I'm continuing this as an academic discussion, Kovid's solution to use \S+ seems to be the best way.

Before I answered, I tried the regex in plain Python. I found that using (?u) didn't work, and I couldn't properly test (?L), because I couldn't figure out how to set the locale within the five minutes or so I spent on the problem. That's why I was asking if Calibre sets that in its code.
Manichean is offline   Reply With Quote
Old 03-08-2011, 05:36 PM   #10
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 25,400
Karma: 4961459
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
locale is set in calibre.startup, but not based on the language in the calibre interface, but whatever python thinks is the system's locale.
kovidgoyal is offline   Reply With Quote
Old 03-08-2011, 05:48 PM   #11
Manichean
Wizard
Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!
 
Manichean's Avatar
 
Posts: 3,130
Karma: 80446
Join Date: Feb 2008
Location: Germany
Device: Cybook Gen3
So, using the L flag should work. I'll have to try it in Calibre tomorrow.
Manichean is offline   Reply With Quote
Old 03-08-2011, 06:54 PM   #12
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
(?u) should have worked, just doublechecked the docs - was this what you tried?:
Code:
(?u)(\w+), (\w+)
I'm not sure I would call \S+ the 'best' solution, it's a good solution given this specific problem, \S+? might be a bit better in case you were dealing with strings that had multiple commas. And Mixx is also correct that semantically \S and \w are quite different. The unicode flag is probably the most 'accurate' option.

I can't say that I'm a big fan of the Locale option after thinking about it - based on the Python regex docs that would work, but it would only work for one locale - if you had authors with non-ascii characters from other locales it wouldn't work - a common scenario for translated works.

Last edited by ldolse; 03-08-2011 at 06:57 PM.
ldolse is offline   Reply With Quote
Old 03-09-2011, 02:59 AM   #13
Mixx
Zealot
Mixx has a complete set of Star Wars action figures.Mixx has a complete set of Star Wars action figures.Mixx has a complete set of Star Wars action figures.Mixx has a complete set of Star Wars action figures.
 
Posts: 138
Karma: 387
Join Date: Sep 2010
Device: Kindle 3
Quote:
Originally Posted by ldolse View Post
(?u) should have worked, just doublechecked the docs - was this what you tried?:
Code:
(?u)(\w+), (\w+)
This does not work for me, unfortunately.

Regards, Mixx
Mixx is offline   Reply With Quote
Old 03-09-2011, 03:30 AM   #14
Manichean
Wizard
Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!
 
Manichean's Avatar
 
Posts: 3,130
Karma: 80446
Join Date: Feb 2008
Location: Germany
Device: Cybook Gen3
Quote:
Originally Posted by ldolse View Post
(?u) should have worked, just doublechecked the docs - was this what you tried?:
Code:
(?u)(\w+), (\w+)
I tried that in plain Python, and the unicode flag didn't make any difference.

Quote:
I can't say that I'm a big fan of the Locale option after thinking about it - based on the Python regex docs that would work, but it would only work for one locale - if you had authors with non-ascii characters from other locales it wouldn't work - a common scenario for translated works.
You're right, I didn't think of that.
Manichean is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Error in Regexp documentation arifzaman Calibre 3 03-02-2011 06:03 AM
removing articles from feeds with regexp Alexis Recipes 1 01-17-2011 07:44 PM
Regexp and Alternate Page Header/Footer adad Calibre 5 01-15-2011 09:03 PM
Multiple line regexp janvanmaar Calibre 19 11-02-2010 01:02 PM
Regexp and header/footer problems concern Calibre 0 02-07-2010 03:35 AM


All times are GMT -4. The time now is 05:06 PM.


MobileRead.com is a privately owned, operated and funded community.