Regex Function - Split unknown word

Paulie_D · 11-28-2014, 03:45 PM

I've been just getting to play with the Regex Functions and am loving it so far.

I am completely useless at this sort of thing but wondered whether a function could be written that could identify words not in the dictionary and check to see if they could be split into two known words.*

The hyphen function seems to do something similar so I thought I would consult the more 'function-minded' here for ideas.

* Yes I am aware of the possible pitfalls.

BetterRed · 11-28-2014, 04:15 PM

@Paulie_D - the spell checker will usually offer a sensible correction to misspellings caused by two words joinedtogether.

BR

Paulie_D · 11-28-2014, 04:35 PM

Quote:

Originally Posted by BetterRed

@Paulie_D - the spell checker will usually offer a sensible correction to misspellings caused by two words joinedtogether.

I see what you did there.

Yes I know but on some of the books I have I have literally hundreds of joined words and doing them one by one is extremely laborious.

I'm just hoping.

BetterRed · 11-28-2014, 05:09 PM

Quote:

Originally Posted by Paulie_D

I see what you did there.

Yes I know but on some of the books I have I have literally hundreds of joined words and doing them one by one is extremely laborious.

I'm just hoping.

I suspect your epub originated from a scanned document or PDF. A search of the Workshop forum might yield something.

I've had a few books originating from PDF scans that were infested with hundreds of joined up words. Usually they've involved a limited number of common words (often proper nouns) so LondonBridge, ofLondon, Londonstreets, leaveLondon etc. With a few simple Regexs I was able to deal with 80% quite quickly. And then I used the spell checker for the remaining 20% - which might have 3 or even 4 words joined together.

BR

Paulie_D · 11-28-2014, 05:33 PM

Quote:

Originally Posted by BetterRed

Usually they've involved a limited number of common words (often proper nouns) so LondonBridge, ofLondon, Londonstreets, leaveLondon etc. With a few simple Regexs I was able to deal with 80% quite quickly. And then I used the spell checker for the remaining 20% - which might have 3 or even 4 words joined together.

BR

Yes...I can handle most simple regex search with a [A-Z] immediately after an [a-z].

Unfortunately, I have many that are just two proper lowercase words joined together, often starting with 'the' or 'some' or somesuch.

I could cycle through a dozen regex s&r (checking each 'find' and confirming them individually) but I was hoping there might be an easier way.

BetterRed · 11-28-2014, 06:19 PM

If you have or have access to the original PDF (if that's what it was) then you could rescan using the Abbyy Fine Print software - most of the aficionados seem to think it's the best of breed.

Have you looked at this ==>> Function Mode for Search & Replace in the Editor. I suspect it could be used do what you want (maybe wrapped in an editor PI). But as it was only released last week I would guess there's not a very large body of expertise in its usage as yet.

BR

Paulie_D · 11-28-2014, 06:56 PM

Quote:

Originally Posted by BetterRed

Have you looked at this ==>> Function Mode for Search & Replace in the Editor. I suspect it could be used do what you want (maybe wrapped in an editor PI). But as it was only released last week I would guess there's not a very large body of expertise in its usage as yet.

BR

Which is what I asked for in the first place....

Quote:

Originally Posted by Paulie_D

I .... wondered whether a function could be written that could identify words not in the dictionary and check to see if they could be split into two known words.*

The hyphen function seems to do something similar so I thought I would consult the more 'function-minded' here for ideas.

BetterRed · 11-28-2014, 07:57 PM

Quote:

Originally Posted by Paulie_D

Which is what I asked for in the first place....

As I said the body of knowledge is sparse, it was only when I was thinking... but a regex engine can't access a dictionary... that I remembered seeing the reference to dictionaries in the Function Mode doco last week.

Good luck.

BR

kovidgoyal · 11-28-2014, 10:39 PM

Here you go, I haven't really tested it, so you might have to adjust it a little:

Code:

import regex
from calibre import replace_entities, prepare_string_for_xml

def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
    def fix_word(m):
        word = m.group()
        if dictionaries.recognized(word):
            return word
        for i in xrange(1, len(word) - 1):
            a, b = word[:i], word[i:]
            if dictionaries.recognized(a) and dictionaries.recognized(b):
                return a + ' ' + b
        return word
    text = replace_entities(match.group(1))
    text = regex.sub(r'\b\w+\b', fix_word, text, flags=regex.VERSION1)
    text = prepare_string_for_xml(text)
    return '>' + text + '<'

Use it with the find expression

>([^<]+)<

Paulie_D · 11-29-2014, 03:57 AM

Wow!

I'll have a good play with this....thank you so very much Kovid.

roger64 · 11-29-2014, 06:55 AM

Hi

This kind of thing (two known words stuck together) happens quite often, most probably like BetterRed said, as the result of a botched scan.

I tried to make work the above function - I use a French dictionary with the Calibre spellchecker - but I failed (it reported it found nothing when I had a glaring example under the nose).

I probably missed something obvious. I use Linux Mint 17 and I have some Python inside it...

Could a good soul provide a basic example of this function that we could replicate and maybe a screenshot?

Phssthpok · 12-05-2014, 09:41 AM

I just tried it and it worked fine... although (with an English book) it split the name "Tula" into "Tu" and "la", as if it was using a French dictionary!

Incidentally, when a function like this doesn't work, is there any way to debug it?

roger64 · 12-05-2014, 09:58 AM

Quote:

Originally Posted by Phssthpok

I just tried it and it worked fine.../...

I created a new function using the text above, and I tried with the 'find' expression

Quote:

>([^<]+)<

, and it found nothing.

Please, could you tell me exactly what you did you write in 'Find' field?

kovidgoyal · 12-05-2014, 12:51 PM

http://manual.calibre-ebook.com/func...your-functions

BetterRed · 12-05-2014, 06:19 PM

Quote:

Originally Posted by Phssthpok

I just tried it and it worked fine... although (with an English book) it split the name "Tula" into "Tu" and "la", as if it was using a French dictionary!

"Tu la" sounds more like Singaporean Latin to me

BR

11-28-2014, 03:45 PM	#1
Paulie_D Connoisseur Posts: 67 Karma: 10 Join Date: Apr 2011 Device: Kindle 3, Samsung Tab 4	Regex Function - Split unknown word I've been just getting to play with the Regex Functions and am loving it so far. I am completely useless at this sort of thing but wondered whether a function could be written that could identify words not in the dictionary and check to see if they could be split into two known words.* The hyphen function seems to do something similar so I thought I would consult the more 'function-minded' here for ideas. * Yes I am aware of the possible pitfalls.

11-28-2014, 04:15 PM	#2
BetterRed null operator (he/him) Posts: 22,070 Karma: 30277960 Join Date: Mar 2012 Location: Sydney Australia Device: none	@Paulie_D - the spell checker will usually offer a sensible correction to misspellings caused by two words joinedtogether. BR Attached Thumbnails

11-29-2014, 06:55 AM	#11
roger64 Wizard Posts: 2,625 Karma: 3120635 Join Date: Jan 2009 Device: Kindle PW3 (wifi)	Hi This kind of thing (two known words stuck together) happens quite often, most probably like BetterRed said, as the result of a botched scan. I tried to make work the above function - I use a French dictionary with the Calibre spellchecker - but I failed (it reported it found nothing when I had a glaring example under the nose). I probably missed something obvious. I use Linux Mint 17 and I have some Python inside it... Could a good soul provide a basic example of this function that we could replicate and maybe a screenshot? Last edited by roger64; 11-29-2014 at 06:58 AM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Using regex for more elegant hyphenation and word wrap	Psymon	Sigil	23	12-01-2014 08:27 PM
Glo Bug or common, quote split from word?	Ripplinger	Kobo Reader	4	07-05-2013 09:38 PM
Regex to insert word at beginning of a line	macnab69	Library Management	1	05-20-2013 03:56 AM
split function bug ?	cybmole	Sigil	6	01-13-2011 01:05 PM

11-28-2014, 06:19 PM	#6
BetterRed null operator (he/him) Posts: 22,070 Karma: 30277960 Join Date: Mar 2012 Location: Sydney Australia Device: none	If you have or have access to the original PDF (if that's what it was) then you could rescan using the Abbyy Fine Print software - most of the aficionados seem to think it's the best of breed. Have you looked at this ==>> Function Mode for Search & Replace in the Editor. I suspect it could be used do what you want (maybe wrapped in an editor PI). But as it was only released last week I would guess there's not a very large body of expertise in its usage as yet. BR

11-29-2014, 03:57 AM	#10
Paulie_D Connoisseur Posts: 67 Karma: 10 Join Date: Apr 2011 Device: Kindle 3, Samsung Tab 4	Wow! I'll have a good play with this....thank you so very much Kovid.

12-05-2014, 09:41 AM	#12
Phssthpok Age improves with wine. Posts: 607 Karma: 95229 Join Date: Nov 2014 Device: Kindle Oasis, Kobo Libra II	I just tried it and it worked fine... although (with an English book) it split the name "Tula" into "Tu" and "la", as if it was using a French dictionary! Incidentally, when a function like this doesn't work, is there any way to debug it?

12-05-2014, 12:51 PM	#14
kovidgoyal creator of calibre Posts: 45,706 Karma: 28549304 Join Date: Oct 2006 Location: Mumbai, India Device: Various	http://manual.calibre-ebook.com/func...your-functions

Advert

Advert