Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre > Library Management

Notices

Reply
 
Thread Tools Search this Thread
Old 11-30-2011, 05:01 PM   #1
capnm
Groupie
capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'
 
Posts: 156
Karma: 10001
Join Date: Feb 2011
Device: sony
RegEx & Unicode

I've been using the following regex to abbreviate series names as initialisms:
Code:
\s*([a-zA-Z]|\d+\.?\d*)[a-z\']*\.?\s*

\1
Now that more & more of my series include unicode characters, I'm wondering if there is an easy way to either modify the [a-zA-Z] and [a-z'] terms to include appropriate accented characters, or to transliterate (transcode?) the string before regex processing.

Or is my best bet just to manually transcode my series? (yuck)

Last edited by capnm; 11-30-2011 at 11:04 PM. Reason: fixing typo I made while removing parentheses
capnm is offline   Reply With Quote
Old 11-30-2011, 09:20 PM   #2
Serpentine
Evangelist
Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.
 
Posts: 416
Karma: 1045911
Join Date: Sep 2011
Location: Cape Town, South Africa
Device: Kindle 3
Supply a sample(s) and expected result(s), make life easy.
Serpentine is offline   Reply With Quote
Old 11-30-2011, 09:21 PM   #3
DoctorOhh
US Navy, Retired
DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.
 
DoctorOhh's Avatar
 
Posts: 9,897
Karma: 13806776
Join Date: Feb 2009
Location: North Carolina
Device: Icarus Illumina XL HD, Kindle PaperWhite SE 11th Gen
Also where are you using this and why?
DoctorOhh is offline   Reply With Quote
Old 11-30-2011, 10:58 PM   #4
capnm
Groupie
capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'
 
Posts: 156
Karma: 10001
Join Date: Feb 2011
Device: sony
Quote:
Originally Posted by Serpentine View Post
Supply a sample(s) and expected result(s), make life easy.
Föô bár
Fb

Though that's pretty irrelevant. I'm not looking for debugging this particular regex, or to start adding tons of individual unicode characters to it.

I'm wondering if calibre's flavor of regex is/can be unicode aware, since I suspect some flavors of regex are, but I've never had occasion to explore the issue before.

Alternatively I thought there might be some calibre template functions that would transliterate a unicode string (though that would have other side effects).

Quote:
Originally Posted by dwanthny View Post
Also where are you using this and why?
At the moment -- in custom columns and plugboards to abbreviate long series names.

But again, it's more of a general question, since at various times, for various reasons, authors, titles, series, etc., get plugged into regexps, and they all have the occasional unicode character which doesn't fall into the standard [a-zA-Z] or \w range.

Last edited by capnm; 11-30-2011 at 11:13 PM.
capnm is offline   Reply With Quote
Old 11-30-2011, 11:24 PM   #5
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 45,598
Karma: 28548962
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
See http://docs.python.org/library/re.html#re.UNICODE
kovidgoyal is offline   Reply With Quote
Old 11-30-2011, 11:52 PM   #6
capnm
Groupie
capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'
 
Posts: 156
Karma: 10001
Join Date: Feb 2011
Device: sony
Quote:
Originally Posted by kovidgoyal View Post
Thanks, I couldn't figure out how to invoke that, so I wasn't sure if it was applicable, but I finally found how to use the (?u) flag.

So I think I'm most of the way there ... but I could still use a little help

I should be able to replace [a-zA-Z] with (?u)[/w/D] (if I ignore _ for now), right?
[edit: of course this doesn't work -- I'm always trying to stick exclusions in a group and it's never worked yet]

But is there an easy equivalent to [a-z] ?

Last edited by capnm; 12-01-2011 at 01:50 AM.
capnm is offline   Reply With Quote
Old 11-30-2011, 11:56 PM   #7
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 45,598
Karma: 28548962
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
If you mean only lowercase letters, then no. Though you can use unicode character ranges, like this [\u0028-\u0046] if you know the character ranges you want.
kovidgoyal is offline   Reply With Quote
Old 12-01-2011, 01:44 AM   #8
capnm
Groupie
capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'
 
Posts: 156
Karma: 10001
Join Date: Feb 2011
Device: sony
Well, I'm stumped again.

I mean -- [\u0000-\uFFFF]* should match anything, including punctuation and two-part characters, right?

But not only does it not grab accented characters, it doesn't grab v,w,x,y, or z.

capnm is offline   Reply With Quote
Old 12-01-2011, 04:23 AM   #9
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123457
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
I think re.UNICODE only causes \w & \W to match non-ascii characters, at least practically speaking. Which would be okay except that \w also includes numbers - if you're okay with matching numbers then \w+ should be ok.

I've always wished it would make [a-zA-Z] work the way capnm wants. I suppose you might be able to mix it with an digit exclusion lookahead:

(?u)(?=[^\d]+)(\w+)

But it's going to get tricky.

Last edited by ldolse; 12-01-2011 at 04:26 AM.
ldolse is offline   Reply With Quote
Old 12-01-2011, 04:31 AM   #10
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 45,598
Karma: 28548962
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Do a bit of googling on how to use unicode char ranges in python regexps. I haven't ever used them myself, so I cannot comment.
kovidgoyal is offline   Reply With Quote
Old 12-01-2011, 11:57 AM   #11
Serpentine
Evangelist
Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.
 
Posts: 416
Karma: 1045911
Join Date: Sep 2011
Location: Cape Town, South Africa
Device: Kindle 3
You can most likely use something like :
Code:
(?i)(?:^|\s+)(\d+\.?\d*?|[\D])
To grab all of the interesting first characters/numbers

Code:
string = r'Föô bár  šjohka'
>>> regex.findall(string)
[u'F', u'b', u'\xe1']
I'm sure you can work it into a replacement without too much of a problem.
Serpentine is offline   Reply With Quote
Old 12-01-2011, 06:33 PM   #12
capnm
Groupie
capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'
 
Posts: 156
Karma: 10001
Join Date: Feb 2011
Device: sony
Quote:
Originally Posted by ldolse View Post
I think re.UNICODE only causes \w & \W to match non-ascii characters, at least practically speaking. Which would be okay except that \w also includes numbers - if you're okay with matching numbers then \w+ should be ok.

I've always wished it would make [a-zA-Z] work the way capnm wants. I suppose you might be able to mix it with an digit exclusion lookahead:

(?u)(?=[^\d]+)(\w+)

But it's going to get tricky.
And what I really want is some form of (?u)[a-z] or (?u)[A-Z] to work, but I think I'm out of luck on that one.

I played/poked around a bit and here's what I found (which may even be correct):

This flavor of python regex supports (?u), which makes \w, \d, \b unicode aware.
It doesn't support \unnnn or \Unnnnnnnn.
It doesn't support upper/lower properties or character classes.


Revising your lookahead idea, I think this will emulate a unicode aware [a-zA-Z]

(?u)\w(?!(?<=[\d_]))

but that doesn't solve my wish ...

Oh, well. This was supposed to be a quick exercise in tweaking some template code. Now I'm just being stubborn

Since I don't forsee any great inspiration on how to make a unicode [a-z], I'll probably settle for adding [à-ÿ] to at least make it Latin-1 aware ...
capnm is offline   Reply With Quote
Old 12-01-2011, 06:47 PM   #13
capnm
Groupie
capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'
 
Posts: 156
Karma: 10001
Join Date: Feb 2011
Device: sony
Quote:
Originally Posted by Serpentine View Post
You can most likely use something like :
Code:
(?i)(?:^|\s+)(\d+\.?\d*?|[\D])
To grab all of the interesting first characters/numbers
Yes ... that's very like one of my variations:
Code:
\s*(\d+\.?\d*\w?|\w)[a-z_\']*\.?\s*
But I'm curious -- why the leading (?:^|\s+) instead of \s*
is there a functional difference?

Thanks!
capnm is offline   Reply With Quote
Old 12-01-2011, 07:17 PM   #14
Serpentine
Evangelist
Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.
 
Posts: 416
Karma: 1045911
Join Date: Sep 2011
Location: Cape Town, South Africa
Device: Kindle 3
Quote:
Originally Posted by capnm View Post
But I'm curious -- why the leading (?:^|\s+) instead of \s* is there a functional difference?
If you're not using the unicode support or don't have the locale flag set, you will end up with some non-whitespace characters(also punctuation you want to avoid) being seen as a break in a word; If you were to use \s*, this would then mean that the next letter - which has the possibility of being in the middle of a word, will be used as an initial.

By specifying that the starting point either has to be the start of a string (careful of multiline issues), this situation is removed as the word can only be separated by one or more spaces.

If you want to use it for replacement - as you wanted, the pattern would be :
Code:
find: (?iu)(?:^|\s+)((?:\d+\.?\d*?)|(?:[\D]))[\w]+
replace: \1
Tho it then uses the unicode flag, a trade off between being robust and easily matching things.

Last edited by Serpentine; 12-01-2011 at 07:43 PM.
Serpentine is offline   Reply With Quote
Old 12-01-2011, 08:23 PM   #15
capnm
Groupie
capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'
 
Posts: 156
Karma: 10001
Join Date: Feb 2011
Device: sony
@Serpentine,

Yes -- that makes sense ... and might be a good way to address what led to my latest round of tweaking -- accented chars from the middle of a word popping up in my abbreviations.
Of course it will complicate the other tweaking I've done over time to make the abbreviations more readable/pertinent, like including most punctuation, but not periods and quotes, and including numeric strings and all capital letters, and ...

Hmmm ... if I abandon including all capital letters, the rest will probably fall together -- that's probably the unicode sticking point ...

After several tweaks, these regexps are probably best rewritten from scratch as they've accumulated redundancies and idiosyncrasies, but sometimes I'm lazy

Maybe I'll focus on redoing my {author_sort}{series}-->{author} plugboard template for the Sony, since someone else might find it useful ...

Thanks!

Last edited by capnm; 12-01-2011 at 08:26 PM.
capnm is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
[Old Thread] Regex "FN LN" to "LN, FN" & reverse? unboggling Library Management 19 11-20-2013 06:44 AM
PRS-T1 PRS-T1 & Asian Fonts/Unicode komugi Sony Reader 20 10-05-2013 11:49 PM
Regex: File Renaming Pre-Import & Importing penguinaka Library Management 20 08-14-2012 06:11 PM
Search & Replace/Regex help!! millertime13 Conversion 4 07-22-2011 02:40 AM
CSS & regex for chapter titles hpstricker Calibre 3 07-17-2008 10:13 AM


All times are GMT -4. The time now is 06:43 PM.


MobileRead.com is a privately owned, operated and funded community.