\b matches accented characters

ElMiko · 06-13-2012, 11:45 AM

I was trying to catch instances where a blank space had been inserted in place of an apostrophe, rendering strings such as "John s " or "we ve " or "don t ", etc. So, I came up with:

Code:

\s(?=([st]|re|ve|ll)\b)

Which worked very nicely, as I replaced the highlighted blankspace with an apostrophe... until I realized that the "\b" was matching accented characters (eg. "a séance" or " töten").

Any ideas?

DiapDealer · 06-13-2012, 02:54 PM

Turn on the unicode properties (*UCP) so \b becomes unicode-aware. It's seeing those characters as non-word boundaries of some sort, otherwise.

Code:

(*UCP)\s(?=([st]|re|ve|ll)\b)

I used this text as a test case:

Code:

<p>a séance töten don t</p>
<p>don tyou see sheriff s</p>
<p>we ll I'll be a mönkey s uncle</p>

ElMiko · 06-13-2012, 05:29 PM

Thanks, as always, DD... works like a charm

tangentially, does using the (*UCP) mean that [a-z] would match the same results as \p{Ll}?

DiapDealer · 06-13-2012, 06:21 PM

Quote:

Originally Posted by ElMiko

Thanks, as always, DD... works like a charm

You're quite welcome.

Quote:

tangentially, does using the (*UCP) mean that [a-z] would match the same results as \p{Ll}?

Sadly, no. Using (*UCP) just means that the behavior of \b \B \w \W \d \D and \s \S and some of the POSIX classes is changed. [a-z] can never be anything other than [a-z].

Jellby · 06-14-2012, 04:12 AM

Besides, there are languages where A and Z are not the first and last letters of the alphabet. For instance, in Danish and Norwegian the alphabet is A ... Z Æ Ø Å (yes, dictionaries are from A to Å). What would [a-z] mean in these cases if it were to be extended to non-ascii characters?

elibrarian · 06-14-2012, 07:40 AM

Quote:

Originally Posted by Jellby

Besides, there are languages where A and Z are not the first and last letters of the alphabet. For instance, in Danish and Norwegian the alphabet is A ... Z Æ Ø Å (yes, dictionaries are from A to Å). What would [a-z] mean in these cases if it were to be extended to non-ascii characters?

[a-z] means exactly that, as far as I know - the small letters a through z. When I use regex to search for the full danish alphabet, I usually use [a-zæøå] or [A-ZÆØÅ]. Which of course doesn't find any other characters, accented or not, but they would not be part of the danish alphabet anyway ...

Regards,

Kim

DiapDealer · 06-14-2012, 08:32 AM

Quote:

Originally Posted by elibrarian

When I use regex to search for the full danish alphabet, I usually use [a-zæøå] or [A-ZÆØÅ]. Which of course doesn't find any other characters, accented or not, but they would not be part of the danish alphabet anyway ...

I find characters in english language books that are not from the english alphabet all the time... does this never happen in the danish books?

Why not just use \p{L} and catch all potential unicode letters? That's more than likely what people are intending to catch when they use [A-Za-z] anyway (whether they consciously realize it or not). Or do people purposely mean to exclude certain characters that occur in words like café or façade or naïve? Just a thought.

I just know I've found that when using "letters" for search criteria in a regexp on an english language text... thinking strictly in terms of "english letters" will often produce results I didn't really intend. The original topic of this thread is a perfect example of this. So I've learned to approach Regex Find & Replace from a "unicode first" frame of mind when it comes to ebooks.

ElMiko · 06-14-2012, 08:39 AM

related question:

why does \b match letters that come after an apostrophe?
eg. A search fro ’\b matches the apostrophes in "there’s" "it’s" "Bob’s", etc...

elibrarian · 06-14-2012, 08:59 AM

Quote:

Originally Posted by DiapDealer

I find characters in english language books that are not from the english alphabet all the time... does this never happen in the danish books?

Generally speaking, no, not often. In some books I find quotations from other languages (mostly french or german), but as I never let a book loose without actually proofreading it from one end to the other, that rarely is a problem.

Quote:

Originally Posted by DiapDealer

I Why not just use \p{L} and catch all potential unicode letters? That's more than likely what people are intending to catch when they use [A-Za-z] anyway (whether they consciously realize it or not). Or do people purposely mean to exclude certain characters that occur in words like café or façade or naïve? Just a thought.

Café can be spelled both café and cafe in danish, and facade is spelled without the cedilla, naïve is naiv and so on - just to take your examples

But I'll definitely try \p{L}

One learns new tricks every day

Regards,

Kim

DiapDealer · 06-14-2012, 09:15 AM

Quote:

Originally Posted by ElMiko

related question:

why does \b match letters that come after an apostrophe?
eg. A search fro ’\b matches the apostrophes in "there’s" "it’s" "Bob’s", etc...

\b doesn't really "match" any characters—or more technically, its match is zero-length. It matches word boundaries. Which can be:

* Before the first character in the string, if the first character is a word character.
* After the last character in the string, if the last character is a word character.
* Between two characters in the string, where one is a word character and the other is not a word character.

A word character—without the (*UCP) flag—is [a-zA-Z0-9_] or \w

"There's"—for better or worse—is not one word in the eyes of regex. Because an apostrophe is not a word character. "There" would be one word and "s" would be another.

What are you wishing ’\b would find?

ElMiko · 06-14-2012, 12:16 PM

Quote:

Originally Posted by DiapDealer

\b doesn't really "match" any characters—or more technically, its match is zero-length. It matches word boundaries. Which can be:

* Before the first character in the string, if the first character is a word character.
* After the last character in the string, if the last character is a word character.
* Between two characters in the string, where one is a word character and the other is not a word character.

A word character—without the (*UCP) flag—is [a-zA-Z0-9_] or \w

"There's"—for better or worse—is not one word in the eyes of regex. Because an apostrophe is not a word character. "There" would be one word and "s" would be another.

What are you wishing ’\b would find?

Again, thanks for the tutorial. Why is it that when an MR poster explains something it makes complete sense, but when i try to read an official Reg Ex tutorial i actually feel my brain cells dying and my life expectancy withering?

The scenario I'm trying to catch is instances in which OCR software interpreted a ” as a ’ . my guess is that the appropriate regex would be something like ’\b(?!\p{Ll}). I could also probably add a negative lookbehind to exclude common instances of the ’ functioning as a (plural) possessive or to denote an omitted character (maybe something like (?<!s|in)). Mostly my question was academic: just a a way for me to get a better understanding of how and why reg-ex behaves the way it does.

DiapDealer · 06-14-2012, 12:50 PM

Quote:

Originally Posted by ElMiko

The scenario I'm trying to catch is instances in which OCR software interpreted a ” as a ’ . my guess is that the appropriate regex would be something like ’\b(?!\p{Ll}). I could also probably add a negative lookbehind to exclude common instances of the ’ functioning as a (plural) possessive or to denote an omitted character (maybe something like (?<!s|in)). Mostly my question was academic: just a a way for me to get a better understanding of how and why reg-ex behaves the way it does.

Could be tough to differentiate possessive apostrophes or contractions from a closing single-quotes with any accuracy. But you might be able to narrow it down enough to feasibly inspect each occurrence.

A lot of times (but certainly not always) in a closing quote situation, the previous character is going to be punctuation of some kind. Quotes within quotes will probably foul things up, though.

06-13-2012, 11:45 AM	#1
ElMiko Addict Posts: 320 Karma: 56788 Join Date: Jun 2011 Device: Kindle	\b matches accented characters I was trying to catch instances where a blank space had been inserted in place of an apostrophe, rendering strings such as "John s " or "we ve " or "don t ", etc. So, I came up with: Code: \s(?=([st]\|re\|ve\|ll)\b) Which worked very nicely, as I replaced the highlighted blankspace with an apostrophe... until I realized that the "\b" was matching accented characters (eg. "a séance" or " töten"). Any ideas?

06-13-2012, 02:54 PM	#2
DiapDealer Grand Sorcerer Posts: 27,547 Karma: 193191846 Join Date: Jan 2010 Device: Nexus 7, Kindle Fire HD	Turn on the unicode properties (UCP) so \b becomes unicode-aware. It's seeing those characters as non-word boundaries of some sort, otherwise. Code: (UCP)\s(?=([st]\|re\|ve\|ll)\b) I used this text as a test case: Code: <p>a séance töten don t</p> <p>don tyou see sheriff s</p> <p>we ll I'll be a mönkey s uncle</p>

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
For the sake of accented characters with Calibre	Naga	Conversion	6	07-02-2011 07:48 AM
Sorting with accented characters	chaley	Calibre	20	12-11-2010 07:14 AM
PRS-600 any way to type spanish accented characters?	arielinflux	Sony Reader	1	03-17-2010 04:22 AM
Foreign accented characters and libprs500	Stingo	Calibre	6	02-24-2008 07:51 PM
Accented characters	bingle	Sony Reader	7	07-25-2007 06:36 AM

06-13-2012, 05:29 PM	#3
ElMiko Addict Posts: 320 Karma: 56788 Join Date: Jun 2011 Device: Kindle	Thanks, as always, DD... works like a charm tangentially, does using the (*UCP) mean that [a-z] would match the same results as \p{Ll}?

06-14-2012, 04:12 AM	#5
Jellby frumious Bandersnatch Posts: 7,516 Karma: 18512745 Join Date: Jan 2008 Location: Spaniard in Sweden Device: Cybook Orizon, Kobo Aura	Besides, there are languages where A and Z are not the first and last letters of the alphabet. For instance, in Danish and Norwegian the alphabet is A ... Z Æ Ø Å (yes, dictionaries are from A to Å). What would [a-z] mean in these cases if it were to be extended to non-ascii characters?

06-14-2012, 08:39 AM	#8
ElMiko Addict Posts: 320 Karma: 56788 Join Date: Jun 2011 Device: Kindle	related question: why does \b match letters that come after an apostrophe? eg. A search fro ’\b matches the apostrophes in "there’s" "it’s" "Bob’s", etc...