11-28-2014, 02:45 PM | #1 |
Connoisseur
Posts: 67
Karma: 10
Join Date: Apr 2011
Device: Kindle 3, Samsung Tab 4
|
Regex Function - Split unknown word
I've been just getting to play with the Regex Functions and am loving it so far.
I am completely useless at this sort of thing but wondered whether a function could be written that could identify words not in the dictionary and check to see if they could be split into two known words.* The hyphen function seems to do something similar so I thought I would consult the more 'function-minded' here for ideas. * Yes I am aware of the possible pitfalls. |
11-28-2014, 03:15 PM | #2 |
null operator (he/him)
Posts: 20,625
Karma: 26960534
Join Date: Mar 2012
Location: Sydney Australia
Device: none
|
@Paulie_D - the spell checker will usually offer a sensible correction to misspellings caused by two words joinedtogether.
BR |
Advert | |
|
11-28-2014, 03:35 PM | #3 | |
Connoisseur
Posts: 67
Karma: 10
Join Date: Apr 2011
Device: Kindle 3, Samsung Tab 4
|
Quote:
Yes I know but on some of the books I have I have literally hundreds of joined words and doing them one by one is extremely laborious. I'm just hoping. |
|
11-28-2014, 04:09 PM | #4 | |
null operator (he/him)
Posts: 20,625
Karma: 26960534
Join Date: Mar 2012
Location: Sydney Australia
Device: none
|
Quote:
I've had a few books originating from PDF scans that were infested with hundreds of joined up words. Usually they've involved a limited number of common words (often proper nouns) so LondonBridge, ofLondon, Londonstreets, leaveLondon etc. With a few simple Regexs I was able to deal with 80% quite quickly. And then I used the spell checker for the remaining 20% - which might have 3 or even 4 words joined together. BR |
|
11-28-2014, 04:33 PM | #5 | |
Connoisseur
Posts: 67
Karma: 10
Join Date: Apr 2011
Device: Kindle 3, Samsung Tab 4
|
Quote:
Unfortunately, I have many that are just two proper lowercase words joined together, often starting with 'the' or 'some' or somesuch. I could cycle through a dozen regex s&r (checking each 'find' and confirming them individually) but I was hoping there might be an easier way. |
|
Advert | |
|
11-28-2014, 05:19 PM | #6 |
null operator (he/him)
Posts: 20,625
Karma: 26960534
Join Date: Mar 2012
Location: Sydney Australia
Device: none
|
If you have or have access to the original PDF (if that's what it was) then you could rescan using the Abbyy Fine Print software - most of the aficionados seem to think it's the best of breed.
Have you looked at this ==>> Function Mode for Search & Replace in the Editor. I suspect it could be used do what you want (maybe wrapped in an editor PI). But as it was only released last week I would guess there's not a very large body of expertise in its usage as yet. BR |
11-28-2014, 05:56 PM | #7 | ||
Connoisseur
Posts: 67
Karma: 10
Join Date: Apr 2011
Device: Kindle 3, Samsung Tab 4
|
Quote:
Quote:
|
||
11-28-2014, 06:57 PM | #8 |
null operator (he/him)
Posts: 20,625
Karma: 26960534
Join Date: Mar 2012
Location: Sydney Australia
Device: none
|
As I said the body of knowledge is sparse, it was only when I was thinking... but a regex engine can't access a dictionary... that I remembered seeing the reference to dictionaries in the Function Mode doco last week.
Good luck. BR |
11-28-2014, 09:39 PM | #9 |
creator of calibre
Posts: 43,930
Karma: 22669822
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Here you go, I haven't really tested it, so you might have to adjust it a little:
Code:
import regex from calibre import replace_entities, prepare_string_for_xml def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs): def fix_word(m): word = m.group() if dictionaries.recognized(word): return word for i in xrange(1, len(word) - 1): a, b = word[:i], word[i:] if dictionaries.recognized(a) and dictionaries.recognized(b): return a + ' ' + b return word text = replace_entities(match.group(1)) text = regex.sub(r'\b\w+\b', fix_word, text, flags=regex.VERSION1) text = prepare_string_for_xml(text) return '>' + text + '<' Use it with the find expression >([^<]+)< |
11-29-2014, 02:57 AM | #10 |
Connoisseur
Posts: 67
Karma: 10
Join Date: Apr 2011
Device: Kindle 3, Samsung Tab 4
|
Wow!
I'll have a good play with this....thank you so very much Kovid. |
11-29-2014, 05:55 AM | #11 |
Wizard
Posts: 2,608
Karma: 3000161
Join Date: Jan 2009
Device: Kindle PW3 (wifi)
|
Hi
This kind of thing (two known words stuck together) happens quite often, most probably like BetterRed said, as the result of a botched scan. I tried to make work the above function - I use a French dictionary with the Calibre spellchecker - but I failed (it reported it found nothing when I had a glaring example under the nose). I probably missed something obvious. I use Linux Mint 17 and I have some Python inside it... Could a good soul provide a basic example of this function that we could replicate and maybe a screenshot? Last edited by roger64; 11-29-2014 at 05:58 AM. |
12-05-2014, 08:41 AM | #12 |
Age improves with wine.
Posts: 558
Karma: 95229
Join Date: Nov 2014
Device: Kindle Oasis, Kobo Libra II
|
I just tried it and it worked fine... although (with an English book) it split the name "Tula" into "Tu" and "la", as if it was using a French dictionary!
Incidentally, when a function like this doesn't work, is there any way to debug it? |
12-05-2014, 08:58 AM | #13 |
Wizard
Posts: 2,608
Karma: 3000161
Join Date: Jan 2009
Device: Kindle PW3 (wifi)
|
|
12-05-2014, 11:51 AM | #14 |
creator of calibre
Posts: 43,930
Karma: 22669822
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
|
12-05-2014, 05:19 PM | #15 |
null operator (he/him)
Posts: 20,625
Karma: 26960534
Join Date: Mar 2012
Location: Sydney Australia
Device: none
|
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Using regex for more elegant hyphenation and word wrap | Psymon | Sigil | 23 | 12-01-2014 07:27 PM |
Glo Bug or common, quote split from word? | Ripplinger | Kobo Reader | 4 | 07-05-2013 08:38 PM |
Regex to insert word at beginning of a line | macnab69 | Library Management | 1 | 05-20-2013 02:56 AM |
split function bug ? | cybmole | Sigil | 6 | 01-13-2011 12:05 PM |