Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Sigil

Notices

Reply
 
Thread Tools Search this Thread
Old 06-13-2012, 11:45 AM   #1
ElMiko
Addict
ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.
 
ElMiko's Avatar
 
Posts: 320
Karma: 56788
Join Date: Jun 2011
Device: Kindle
\b matches accented characters

I was trying to catch instances where a blank space had been inserted in place of an apostrophe, rendering strings such as "John s " or "we ve " or "don t ", etc. So, I came up with:

Code:
\s(?=([st]|re|ve|ll)\b)
Which worked very nicely, as I replaced the highlighted blankspace with an apostrophe... until I realized that the "\b" was matching accented characters (eg. "a séance" or " töten").

Any ideas?
ElMiko is offline   Reply With Quote
Old 06-13-2012, 02:54 PM   #2
DiapDealer
Grand Sorcerer
DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.
 
DiapDealer's Avatar
 
Posts: 27,547
Karma: 193191846
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
Turn on the unicode properties (*UCP) so \b becomes unicode-aware. It's seeing those characters as non-word boundaries of some sort, otherwise.

Code:
(*UCP)\s(?=([st]|re|ve|ll)\b)
I used this text as a test case:
Code:
<p>a séance töten don t</p>
<p>don tyou see sheriff s</p>
<p>we ll I'll be a mönkey s uncle</p>
DiapDealer is offline   Reply With Quote
Old 06-13-2012, 05:29 PM   #3
ElMiko
Addict
ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.
 
ElMiko's Avatar
 
Posts: 320
Karma: 56788
Join Date: Jun 2011
Device: Kindle
Thanks, as always, DD... works like a charm

tangentially, does using the (*UCP) mean that [a-z] would match the same results as \p{Ll}?
ElMiko is offline   Reply With Quote
Old 06-13-2012, 06:21 PM   #4
DiapDealer
Grand Sorcerer
DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.
 
DiapDealer's Avatar
 
Posts: 27,547
Karma: 193191846
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
Quote:
Originally Posted by ElMiko View Post
Thanks, as always, DD... works like a charm
You're quite welcome.

Quote:
tangentially, does using the (*UCP) mean that [a-z] would match the same results as \p{Ll}?
Sadly, no. Using (*UCP) just means that the behavior of \b \B \w \W \d \D and \s \S and some of the POSIX classes is changed. [a-z] can never be anything other than [a-z].
DiapDealer is offline   Reply With Quote
Old 06-14-2012, 04:12 AM   #5
Jellby
frumious Bandersnatch
Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.
 
Jellby's Avatar
 
Posts: 7,516
Karma: 18512745
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
Besides, there are languages where A and Z are not the first and last letters of the alphabet. For instance, in Danish and Norwegian the alphabet is A ... Z Æ Ø Å (yes, dictionaries are from A to Å). What would [a-z] mean in these cases if it were to be extended to non-ascii characters?
Jellby is offline   Reply With Quote
Old 06-14-2012, 07:40 AM   #6
elibrarian
Imperfect Perfectionist
elibrarian ought to be getting tired of karma fortunes by now.elibrarian ought to be getting tired of karma fortunes by now.elibrarian ought to be getting tired of karma fortunes by now.elibrarian ought to be getting tired of karma fortunes by now.elibrarian ought to be getting tired of karma fortunes by now.elibrarian ought to be getting tired of karma fortunes by now.elibrarian ought to be getting tired of karma fortunes by now.elibrarian ought to be getting tired of karma fortunes by now.elibrarian ought to be getting tired of karma fortunes by now.elibrarian ought to be getting tired of karma fortunes by now.elibrarian ought to be getting tired of karma fortunes by now.
 
elibrarian's Avatar
 
Posts: 464
Karma: 724664
Join Date: Dec 2011
Location: Ølstykke, Denmark
Device: none
Quote:
Originally Posted by Jellby View Post
Besides, there are languages where A and Z are not the first and last letters of the alphabet. For instance, in Danish and Norwegian the alphabet is A ... Z Æ Ø Å (yes, dictionaries are from A to Å). What would [a-z] mean in these cases if it were to be extended to non-ascii characters?
[a-z] means exactly that, as far as I know - the small letters a through z. When I use regex to search for the full danish alphabet, I usually use [a-zæøå] or [A-ZÆØÅ]. Which of course doesn't find any other characters, accented or not, but they would not be part of the danish alphabet anyway ...

Regards,

Kim
elibrarian is offline   Reply With Quote
Old 06-14-2012, 08:32 AM   #7
DiapDealer
Grand Sorcerer
DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.
 
DiapDealer's Avatar
 
Posts: 27,547
Karma: 193191846
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
Quote:
Originally Posted by elibrarian View Post
When I use regex to search for the full danish alphabet, I usually use [a-zæøå] or [A-ZÆØÅ]. Which of course doesn't find any other characters, accented or not, but they would not be part of the danish alphabet anyway ...
I find characters in english language books that are not from the english alphabet all the time... does this never happen in the danish books?

Why not just use \p{L} and catch all potential unicode letters? That's more than likely what people are intending to catch when they use [A-Za-z] anyway (whether they consciously realize it or not). Or do people purposely mean to exclude certain characters that occur in words like café or façade or naïve? Just a thought.

I just know I've found that when using "letters" for search criteria in a regexp on an english language text... thinking strictly in terms of "english letters" will often produce results I didn't really intend. The original topic of this thread is a perfect example of this. So I've learned to approach Regex Find & Replace from a "unicode first" frame of mind when it comes to ebooks.

Last edited by DiapDealer; 06-14-2012 at 08:36 AM.
DiapDealer is offline   Reply With Quote
Old 06-14-2012, 08:39 AM   #8
ElMiko
Addict
ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.
 
ElMiko's Avatar
 
Posts: 320
Karma: 56788
Join Date: Jun 2011
Device: Kindle
related question:

why does \b match letters that come after an apostrophe?
eg. A search fro ’\b matches the apostrophes in "theres" "its" "Bobs", etc...
ElMiko is offline   Reply With Quote
Old 06-14-2012, 08:59 AM   #9
elibrarian
Imperfect Perfectionist
elibrarian ought to be getting tired of karma fortunes by now.elibrarian ought to be getting tired of karma fortunes by now.elibrarian ought to be getting tired of karma fortunes by now.elibrarian ought to be getting tired of karma fortunes by now.elibrarian ought to be getting tired of karma fortunes by now.elibrarian ought to be getting tired of karma fortunes by now.elibrarian ought to be getting tired of karma fortunes by now.elibrarian ought to be getting tired of karma fortunes by now.elibrarian ought to be getting tired of karma fortunes by now.elibrarian ought to be getting tired of karma fortunes by now.elibrarian ought to be getting tired of karma fortunes by now.
 
elibrarian's Avatar
 
Posts: 464
Karma: 724664
Join Date: Dec 2011
Location: Ølstykke, Denmark
Device: none
Quote:
Originally Posted by DiapDealer View Post
I find characters in english language books that are not from the english alphabet all the time... does this never happen in the danish books?
Generally speaking, no, not often. In some books I find quotations from other languages (mostly french or german), but as I never let a book loose without actually proofreading it from one end to the other, that rarely is a problem.

Quote:
Originally Posted by DiapDealer View Post
I Why not just use \p{L} and catch all potential unicode letters? That's more than likely what people are intending to catch when they use [A-Za-z] anyway (whether they consciously realize it or not). Or do people purposely mean to exclude certain characters that occur in words like café or façade or naïve? Just a thought.
Café can be spelled both café and cafe in danish, and facade is spelled without the cedilla, naïve is naiv and so on - just to take your examples

But I'll definitely try \p{L}

One learns new tricks every day

Regards,

Kim
elibrarian is offline   Reply With Quote
Old 06-14-2012, 09:15 AM   #10
DiapDealer
Grand Sorcerer
DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.
 
DiapDealer's Avatar
 
Posts: 27,547
Karma: 193191846
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
Quote:
Originally Posted by ElMiko View Post
related question:

why does \b match letters that come after an apostrophe?
eg. A search fro ’\b matches the apostrophes in "theres" "its" "Bobs", etc...
\b doesn't really "match" any characters—or more technically, its match is zero-length. It matches word boundaries. Which can be:

* Before the first character in the string, if the first character is a word character.
* After the last character in the string, if the last character is a word character.
* Between two characters in the string, where one is a word character and the other is not a word character.

A word character—without the (*UCP) flag—is [a-zA-Z0-9_] or \w

"There's"—for better or worse—is not one word in the eyes of regex. Because an apostrophe is not a word character. "There" would be one word and "s" would be another.

What are you wishing ’\b would find?

Last edited by DiapDealer; 06-14-2012 at 09:45 AM.
DiapDealer is offline   Reply With Quote
Old 06-14-2012, 12:16 PM   #11
ElMiko
Addict
ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.
 
ElMiko's Avatar
 
Posts: 320
Karma: 56788
Join Date: Jun 2011
Device: Kindle
Quote:
Originally Posted by DiapDealer View Post
\b doesn't really "match" any characters—or more technically, its match is zero-length. It matches word boundaries. Which can be:

* Before the first character in the string, if the first character is a word character.
* After the last character in the string, if the last character is a word character.
* Between two characters in the string, where one is a word character and the other is not a word character.

A word character—without the (*UCP) flag—is [a-zA-Z0-9_] or \w

"There's"—for better or worse—is not one word in the eyes of regex. Because an apostrophe is not a word character. "There" would be one word and "s" would be another.

What are you wishing ’\b would find?
Again, thanks for the tutorial. Why is it that when an MR poster explains something it makes complete sense, but when i try to read an official Reg Ex tutorial i actually feel my brain cells dying and my life expectancy withering?

The scenario I'm trying to catch is instances in which OCR software interpreted a ” as a ’ . my guess is that the appropriate regex would be something like ’\b(?!\p{Ll}). I could also probably add a negative lookbehind to exclude common instances of the ’ functioning as a (plural) possessive or to denote an omitted character (maybe something like (?<!s|in)). Mostly my question was academic: just a a way for me to get a better understanding of how and why reg-ex behaves the way it does.

Last edited by ElMiko; 06-14-2012 at 12:18 PM.
ElMiko is offline   Reply With Quote
Old 06-14-2012, 12:50 PM   #12
DiapDealer
Grand Sorcerer
DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.
 
DiapDealer's Avatar
 
Posts: 27,547
Karma: 193191846
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
Quote:
Originally Posted by ElMiko View Post
The scenario I'm trying to catch is instances in which OCR software interpreted a ” as a ’ . my guess is that the appropriate regex would be something like ’\b(?!\p{Ll}). I could also probably add a negative lookbehind to exclude common instances of the ’ functioning as a (plural) possessive or to denote an omitted character (maybe something like (?<!s|in)). Mostly my question was academic: just a a way for me to get a better understanding of how and why reg-ex behaves the way it does.
Could be tough to differentiate possessive apostrophes or contractions from a closing single-quotes with any accuracy. But you might be able to narrow it down enough to feasibly inspect each occurrence.

A lot of times (but certainly not always) in a closing quote situation, the previous character is going to be punctuation of some kind. Quotes within quotes will probably foul things up, though.

Last edited by DiapDealer; 06-14-2012 at 01:03 PM.
DiapDealer is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
For the sake of accented characters with Calibre Naga Conversion 6 07-02-2011 07:48 AM
Sorting with accented characters chaley Calibre 20 12-11-2010 07:14 AM
PRS-600 any way to type spanish accented characters? arielinflux Sony Reader 1 03-17-2010 04:22 AM
Foreign accented characters and libprs500 Stingo Calibre 6 02-24-2008 07:51 PM
Accented characters bingle Sony Reader 7 07-25-2007 06:36 AM


All times are GMT -4. The time now is 08:12 AM.


MobileRead.com is a privately owned, operated and funded community.