| 
			
			 | 
		#1 | 
| 
			
			
			
			 Groupie 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 156 
				Karma: 10001 
				Join Date: Feb 2011 
				
				
				
				Device: sony 
				
				
				 | 
	
	
	
		
		
			
			 
				
				RegEx & Unicode
			 
			
			
			I've been using the following regex to abbreviate series names as initialisms: 
		
	
		
		
		
		
		
		
		
		
		
		
		
			Code: 
	\s*([a-zA-Z]|\d+\.?\d*)[a-z\']*\.?\s* \1 Or is my best bet just to manually transcode my series? (yuck) Last edited by capnm; 12-01-2011 at 12:04 AM. Reason: fixing typo I made while removing parentheses  | 
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#2 | 
| 
			
			
			
			 Evangelist 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 416 
				Karma: 1045911 
				Join Date: Sep 2011 
				Location: Cape Town, South Africa 
				
				
				Device: Kindle 3 
				
				
				 | 
	
	
	
		
		
		
		
		 
			
			Supply a sample(s) and expected result(s), make life easy.
		 
		
	
		
		
		
		
		
		
		
		
		
		
	
	 | 
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#3 | 
| 
			
			
			
			 US Navy, Retired 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 9,897 
				Karma: 13806776 
				Join Date: Feb 2009 
				Location: North Carolina 
				
				
				Device: Icarus Illumina XL HD, Kindle PaperWhite SE 11th Gen 
				
				
				 | 
	
	
	
		
		
		
		
		 
			
			Also where are you using this and why?
		 
		
	
		
		
		
		
		
		
		
		
		
		
	
	 | 
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#4 | 
| 
			
			
			
			 Groupie 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 156 
				Karma: 10001 
				Join Date: Feb 2011 
				
				
				
				Device: sony 
				
				
				 | 
	
	
	
		
		
		
		
		 
			
			Föô bár 
		
	
		
		
		
		
		
		
		
		
		
		
		
			Fb Though that's pretty irrelevant. I'm not looking for debugging this particular regex, or to start adding tons of individual unicode characters to it. I'm wondering if calibre's flavor of regex is/can be unicode aware, since I suspect some flavors of regex are, but I've never had occasion to explore the issue before. Alternatively I thought there might be some calibre template functions that would transliterate a unicode string (though that would have other side effects). At the moment -- in custom columns and plugboards to abbreviate long series names. But again, it's more of a general question, since at various times, for various reasons, authors, titles, series, etc., get plugged into regexps, and they all have the occasional unicode character which doesn't fall into the standard [a-zA-Z] or \w range. Last edited by capnm; 12-01-2011 at 12:13 AM.  | 
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#5 | 
| 
			
			
			
			 creator of calibre 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,609 
				Karma: 28549044 
				Join Date: Oct 2006 
				Location: Mumbai, India 
				
				
				Device: Various 
				
				
				 | 
	
	|
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#6 | |
| 
			
			
			
			 Groupie 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 156 
				Karma: 10001 
				Join Date: Feb 2011 
				
				
				
				Device: sony 
				
				
				 | 
	
	
	
		
		
		
		
		 Quote: 
	So I think I'm most of the way there ... but I could still use a little help ![]() I should be able to replace [a-zA-Z] with (?u)[/w/D] (if I ignore _ for now), right? [edit: of course this doesn't work -- I'm always trying to stick exclusions in a group and it's never worked yet  ]But is there an easy equivalent to [a-z] ? Last edited by capnm; 12-01-2011 at 02:50 AM.  | 
|
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#7 | 
| 
			
			
			
			 creator of calibre 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,609 
				Karma: 28549044 
				Join Date: Oct 2006 
				Location: Mumbai, India 
				
				
				Device: Various 
				
				
				 | 
	
	
	
		
		
		
		
		 
			
			If you mean only lowercase letters, then no. Though you can use unicode character ranges, like this [\u0028-\u0046] if you know the character ranges you want.
		 
		
	
		
		
		
		
		
		
		
		
		
		
	
	 | 
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#8 | 
| 
			
			
			
			 Groupie 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 156 
				Karma: 10001 
				Join Date: Feb 2011 
				
				
				
				Device: sony 
				
				
				 | 
	
	
	
		
		
		
		
		 
			
			Well, I'm stumped again. 
		
	
		
		
		
		
		
		
		
		
		
		
	
	I mean -- [\u0000-\uFFFF]* should match anything, including punctuation and two-part characters, right? But not only does it not grab accented characters, it doesn't grab v,w,x,y, or z.  
		 | 
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#9 | 
| 
			
			
			
			 Wizard 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,337 
				Karma: 123457 
				Join Date: Apr 2009 
				Location: Malaysia 
				
				
				Device: PRS-650, iPhone 
				
				
				 | 
	
	
	
		
		
		
		
		 
			
			I think re.UNICODE only causes \w & \W to match non-ascii characters, at least practically speaking.  Which would be okay except that \w also includes numbers - if you're okay with matching numbers then \w+ should be ok. 
		
	
		
		
		
		
		
		
		
		
		
		
		
			I've always wished it would make [a-zA-Z] work the way capnm wants. I suppose you might be able to mix it with an digit exclusion lookahead: (?u)(?=[^\d]+)(\w+) But it's going to get tricky. Last edited by ldolse; 12-01-2011 at 05:26 AM.  | 
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#10 | 
| 
			
			
			
			 creator of calibre 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,609 
				Karma: 28549044 
				Join Date: Oct 2006 
				Location: Mumbai, India 
				
				
				Device: Various 
				
				
				 | 
	
	
	
		
		
		
		
		 
			
			Do a bit of googling on how to use unicode char ranges in python regexps. I haven't ever used them myself, so I cannot comment.
		 
		
	
		
		
		
		
		
		
		
		
		
		
	
	 | 
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#11 | 
| 
			
			
			
			 Evangelist 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 416 
				Karma: 1045911 
				Join Date: Sep 2011 
				Location: Cape Town, South Africa 
				
				
				Device: Kindle 3 
				
				
				 | 
	
	
	
		
		
		
		
		 
			
			You can most likely use something like : 
		
	
		
		
		
		
		
		
		
		
		
		
	
	Code: 
	(?i)(?:^|\s+)(\d+\.?\d*?|[\D]) Code: 
	string = r'Föô bár šjohka' >>> regex.findall(string) [u'F', u'b', u'\xe1']  | 
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#12 | |
| 
			
			
			
			 Groupie 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 156 
				Karma: 10001 
				Join Date: Feb 2011 
				
				
				
				Device: sony 
				
				
				 | 
	
	
	
		
		
		
		
		 Quote: 
	
 I played/poked around a bit and here's what I found (which may even be correct): This flavor of python regex supports (?u), which makes \w, \d, \b unicode aware. It doesn't support \unnnn or \Unnnnnnnn. It doesn't support upper/lower properties or character classes. ![]() Revising your lookahead idea, I think this will emulate a unicode aware [a-zA-Z] (?u)\w(?!(?<=[\d_])) but that doesn't solve my wish ... Oh, well. This was supposed to be a quick exercise in tweaking some template code. Now I'm just being stubborn ![]() Since I don't forsee any great inspiration on how to make a unicode [a-z], I'll probably settle for adding [à-ÿ] to at least make it Latin-1 aware ...  | 
|
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#13 | |
| 
			
			
			
			 Groupie 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 156 
				Karma: 10001 
				Join Date: Feb 2011 
				
				
				
				Device: sony 
				
				
				 | 
	
	
	
		
		
		
		
		 Quote: 
	
 Code: 
	\s*(\d+\.?\d*\w?|\w)[a-z_\']*\.?\s* is there a functional difference? Thanks!  | 
|
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#14 | |
| 
			
			
			
			 Evangelist 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 416 
				Karma: 1045911 
				Join Date: Sep 2011 
				Location: Cape Town, South Africa 
				
				
				Device: Kindle 3 
				
				
				 | 
	
	
	
		
		
		
		
		 Quote: 
	
 By specifying that the starting point either has to be the start of a string (careful of multiline issues), this situation is removed as the word can only be separated by one or more spaces. If you want to use it for replacement - as you wanted, the pattern would be : Code: 
	find: (?iu)(?:^|\s+)((?:\d+\.?\d*?)|(?:[\D]))[\w]+ replace: \1 Last edited by Serpentine; 12-01-2011 at 08:43 PM.  | 
|
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#15 | 
| 
			
			
			
			 Groupie 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 156 
				Karma: 10001 
				Join Date: Feb 2011 
				
				
				
				Device: sony 
				
				
				 | 
	
	
	
		
		
		
		
		 
			
			@Serpentine, 
		
	
		
		
		
		
		
		
		
		
		
		
		
			Yes -- that makes sense ... and might be a good way to address what led to my latest round of tweaking -- accented chars from the middle of a word popping up in my abbreviations. Of course it will complicate the other tweaking I've done over time to make the abbreviations more readable/pertinent, like including most punctuation, but not periods and quotes, and including numeric strings and all capital letters, and ... ![]() Hmmm ... if I abandon including all capital letters, the rest will probably fall together -- that's probably the unicode sticking point ... After several tweaks, these regexps are probably best rewritten from scratch as they've accumulated redundancies and idiosyncrasies, but sometimes I'm lazy ![]() Maybe I'll focus on redoing my {author_sort}{series}-->{author} plugboard template for the Sony, since someone else might find it useful ... Thanks! Last edited by capnm; 12-01-2011 at 09:26 PM.  | 
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
![]()  | 
            
        
    
            
  | 
    
			 
			Similar Threads
		 | 
	||||
| Thread | Thread Starter | Forum | Replies | Last Post | 
| [Old Thread] Regex "FN LN" to "LN, FN" & reverse? | unboggling | Library Management | 19 | 11-20-2013 07:44 AM | 
| PRS-T1 PRS-T1 & Asian Fonts/Unicode | komugi | Sony Reader | 20 | 10-06-2013 12:49 AM | 
| Regex: File Renaming Pre-Import & Importing | penguinaka | Library Management | 20 | 08-14-2012 07:11 PM | 
| Search & Replace/Regex help!! | millertime13 | Conversion | 4 | 07-22-2011 03:40 AM | 
| CSS & regex for chapter titles | hpstricker | Calibre | 3 | 07-17-2008 11:13 AM |