|
|
#1 |
|
Connoisseur
![]() Posts: 67
Karma: 10
Join Date: Apr 2011
Device: Kindle 3, Samsung Tab 4
|
Regex Function - Split unknown word
I've been just getting to play with the Regex Functions and am loving it so far.
I am completely useless at this sort of thing but wondered whether a function could be written that could identify words not in the dictionary and check to see if they could be split into two known words.* The hyphen function seems to do something similar so I thought I would consult the more 'function-minded' here for ideas. * Yes I am aware of the possible pitfalls. |
|
|
|
|
|
#2 |
|
null operator (he/him)
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 22,024
Karma: 30277294
Join Date: Mar 2012
Location: Sydney Australia
Device: none
|
@Paulie_D - the spell checker will usually offer a sensible correction to misspellings caused by two words joinedtogether.
BR |
|
|
|
| Advert | |
|
|
|
|
#3 | |
|
Connoisseur
![]() Posts: 67
Karma: 10
Join Date: Apr 2011
Device: Kindle 3, Samsung Tab 4
|
Quote:
![]() Yes I know but on some of the books I have I have literally hundreds of joined words and doing them one by one is extremely laborious. I'm just hoping.
|
|
|
|
|
|
|
#4 | |
|
null operator (he/him)
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 22,024
Karma: 30277294
Join Date: Mar 2012
Location: Sydney Australia
Device: none
|
Quote:
I've had a few books originating from PDF scans that were infested with hundreds of joined up words. Usually they've involved a limited number of common words (often proper nouns) so LondonBridge, ofLondon, Londonstreets, leaveLondon etc. With a few simple Regexs I was able to deal with 80% quite quickly. And then I used the spell checker for the remaining 20% - which might have 3 or even 4 words joined together. BR |
|
|
|
|
|
|
#5 | |
|
Connoisseur
![]() Posts: 67
Karma: 10
Join Date: Apr 2011
Device: Kindle 3, Samsung Tab 4
|
Quote:
Unfortunately, I have many that are just two proper lowercase words joined together, often starting with 'the' or 'some' or somesuch. I could cycle through a dozen regex s&r (checking each 'find' and confirming them individually) but I was hoping there might be an easier way. |
|
|
|
|
| Advert | |
|
|
|
|
#6 |
|
null operator (he/him)
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 22,024
Karma: 30277294
Join Date: Mar 2012
Location: Sydney Australia
Device: none
|
If you have or have access to the original PDF (if that's what it was) then you could rescan using the Abbyy Fine Print software - most of the aficionados seem to think it's the best of breed.
Have you looked at this ==>> Function Mode for Search & Replace in the Editor. I suspect it could be used do what you want (maybe wrapped in an editor PI). But as it was only released last week I would guess there's not a very large body of expertise in its usage as yet. BR |
|
|
|
|
|
#7 | ||
|
Connoisseur
![]() Posts: 67
Karma: 10
Join Date: Apr 2011
Device: Kindle 3, Samsung Tab 4
|
Quote:
![]() Quote:
|
||
|
|
|
|
|
#8 |
|
null operator (he/him)
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 22,024
Karma: 30277294
Join Date: Mar 2012
Location: Sydney Australia
Device: none
|
As I said the body of knowledge is sparse, it was only when I was thinking... but a regex engine can't access a dictionary... that I remembered seeing the reference to dictionaries in the Function Mode doco last week.
Good luck. BR |
|
|
|
|
|
#9 |
|
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,609
Karma: 28549044
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Here you go, I haven't really tested it, so you might have to adjust it a little:
Code:
import regex
from calibre import replace_entities, prepare_string_for_xml
def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
def fix_word(m):
word = m.group()
if dictionaries.recognized(word):
return word
for i in xrange(1, len(word) - 1):
a, b = word[:i], word[i:]
if dictionaries.recognized(a) and dictionaries.recognized(b):
return a + ' ' + b
return word
text = replace_entities(match.group(1))
text = regex.sub(r'\b\w+\b', fix_word, text, flags=regex.VERSION1)
text = prepare_string_for_xml(text)
return '>' + text + '<'
Use it with the find expression >([^<]+)< |
|
|
|
|
|
#10 |
|
Connoisseur
![]() Posts: 67
Karma: 10
Join Date: Apr 2011
Device: Kindle 3, Samsung Tab 4
|
Wow!
![]() I'll have a good play with this....thank you so very much Kovid. |
|
|
|
|
|
#11 |
|
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,625
Karma: 3120635
Join Date: Jan 2009
Device: Kindle PW3 (wifi)
|
Hi
This kind of thing (two known words stuck together) happens quite often, most probably like BetterRed said, as the result of a botched scan. I tried to make work the above function - I use a French dictionary with the Calibre spellchecker - but I failed (it reported it found nothing when I had a glaring example under the nose). I probably missed something obvious. I use Linux Mint 17 and I have some Python inside it... Could a good soul provide a basic example of this function that we could replicate and maybe a screenshot?
Last edited by roger64; 11-29-2014 at 06:58 AM. |
|
|
|
|
|
#12 |
|
Age improves with wine.
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 596
Karma: 95229
Join Date: Nov 2014
Device: Kindle Oasis, Kobo Libra II
|
I just tried it and it worked fine... although (with an English book) it split the name "Tula" into "Tu" and "la", as if it was using a French dictionary!
Incidentally, when a function like this doesn't work, is there any way to debug it? |
|
|
|
|
|
#13 |
|
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,625
Karma: 3120635
Join Date: Jan 2009
Device: Kindle PW3 (wifi)
|
|
|
|
|
|
|
#14 |
|
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,609
Karma: 28549044
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
|
|
|
|
|
|
#15 |
|
null operator (he/him)
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 22,024
Karma: 30277294
Join Date: Mar 2012
Location: Sydney Australia
Device: none
|
|
|
|
|
![]() |
|
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Using regex for more elegant hyphenation and word wrap | Psymon | Sigil | 23 | 12-01-2014 08:27 PM |
| Glo Bug or common, quote split from word? | Ripplinger | Kobo Reader | 4 | 07-05-2013 09:38 PM |
| Regex to insert word at beginning of a line | macnab69 | Library Management | 1 | 05-20-2013 03:56 AM |
| split function bug ? | cybmole | Sigil | 6 | 01-13-2011 01:05 PM |