Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Editor

Notices

Reply
 
Thread Tools Search this Thread
Old 11-28-2014, 02:45 PM   #1
Paulie_D
Connoisseur
Paulie_D began at the beginning.
 
Paulie_D's Avatar
 
Posts: 67
Karma: 10
Join Date: Apr 2011
Device: Kindle 3, Samsung Tab 4
Regex Function - Split unknown word

I've been just getting to play with the Regex Functions and am loving it so far.

I am completely useless at this sort of thing but wondered whether a function could be written that could identify words not in the dictionary and check to see if they could be split into two known words.*

The hyphen function seems to do something similar so I thought I would consult the more 'function-minded' here for ideas.

* Yes I am aware of the possible pitfalls.
Paulie_D is offline   Reply With Quote
Old 11-28-2014, 03:15 PM   #2
BetterRed
null operator (he/him)
BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.
 
Posts: 20,625
Karma: 26960534
Join Date: Mar 2012
Location: Sydney Australia
Device: none
@Paulie_D - the spell checker will usually offer a sensible correction to misspellings caused by two words joinedtogether.

BR
Attached Thumbnails
Click image for larger version

Name:	Capture.JPG
Views:	320
Size:	56.6 KB
ID:	131707  
BetterRed is offline   Reply With Quote
Advert
Old 11-28-2014, 03:35 PM   #3
Paulie_D
Connoisseur
Paulie_D began at the beginning.
 
Paulie_D's Avatar
 
Posts: 67
Karma: 10
Join Date: Apr 2011
Device: Kindle 3, Samsung Tab 4
Quote:
Originally Posted by BetterRed View Post
@Paulie_D - the spell checker will usually offer a sensible correction to misspellings caused by two words joinedtogether.
I see what you did there.

Yes I know but on some of the books I have I have literally hundreds of joined words and doing them one by one is extremely laborious.

I'm just hoping.
Paulie_D is offline   Reply With Quote
Old 11-28-2014, 04:09 PM   #4
BetterRed
null operator (he/him)
BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.
 
Posts: 20,625
Karma: 26960534
Join Date: Mar 2012
Location: Sydney Australia
Device: none
Quote:
Originally Posted by Paulie_D View Post
I see what you did there.

Yes I know but on some of the books I have I have literally hundreds of joined words and doing them one by one is extremely laborious.

I'm just hoping.
I suspect your epub originated from a scanned document or PDF. A search of the Workshop forum might yield something.

I've had a few books originating from PDF scans that were infested with hundreds of joined up words. Usually they've involved a limited number of common words (often proper nouns) so LondonBridge, ofLondon, Londonstreets, leaveLondon etc. With a few simple Regexs I was able to deal with 80% quite quickly. And then I used the spell checker for the remaining 20% - which might have 3 or even 4 words joined together.

BR
BetterRed is offline   Reply With Quote
Old 11-28-2014, 04:33 PM   #5
Paulie_D
Connoisseur
Paulie_D began at the beginning.
 
Paulie_D's Avatar
 
Posts: 67
Karma: 10
Join Date: Apr 2011
Device: Kindle 3, Samsung Tab 4
Quote:
Originally Posted by BetterRed View Post
Usually they've involved a limited number of common words (often proper nouns) so LondonBridge, ofLondon, Londonstreets, leaveLondon etc. With a few simple Regexs I was able to deal with 80% quite quickly. And then I used the spell checker for the remaining 20% - which might have 3 or even 4 words joined together.

BR
Yes...I can handle most simple regex search with a [A-Z] immediately after an [a-z].

Unfortunately, I have many that are just two proper lowercase words joined together, often starting with 'the' or 'some' or somesuch.

I could cycle through a dozen regex s&r (checking each 'find' and confirming them individually) but I was hoping there might be an easier way.
Paulie_D is offline   Reply With Quote
Advert
Old 11-28-2014, 05:19 PM   #6
BetterRed
null operator (he/him)
BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.
 
Posts: 20,625
Karma: 26960534
Join Date: Mar 2012
Location: Sydney Australia
Device: none
If you have or have access to the original PDF (if that's what it was) then you could rescan using the Abbyy Fine Print software - most of the aficionados seem to think it's the best of breed.

Have you looked at this ==>> Function Mode for Search & Replace in the Editor. I suspect it could be used do what you want (maybe wrapped in an editor PI). But as it was only released last week I would guess there's not a very large body of expertise in its usage as yet.

BR
BetterRed is offline   Reply With Quote
Old 11-28-2014, 05:56 PM   #7
Paulie_D
Connoisseur
Paulie_D began at the beginning.
 
Paulie_D's Avatar
 
Posts: 67
Karma: 10
Join Date: Apr 2011
Device: Kindle 3, Samsung Tab 4
Quote:
Originally Posted by BetterRed View Post
Have you looked at this ==>> Function Mode for Search & Replace in the Editor. I suspect it could be used do what you want (maybe wrapped in an editor PI). But as it was only released last week I would guess there's not a very large body of expertise in its usage as yet.

BR
Which is what I asked for in the first place....

Quote:
Originally Posted by Paulie_D View Post
I .... wondered whether a function could be written that could identify words not in the dictionary and check to see if they could be split into two known words.*

The hyphen function seems to do something similar so I thought I would consult the more 'function-minded' here for ideas.
Paulie_D is offline   Reply With Quote
Old 11-28-2014, 06:57 PM   #8
BetterRed
null operator (he/him)
BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.
 
Posts: 20,625
Karma: 26960534
Join Date: Mar 2012
Location: Sydney Australia
Device: none
Quote:
Originally Posted by Paulie_D View Post
Which is what I asked for in the first place....
As I said the body of knowledge is sparse, it was only when I was thinking... but a regex engine can't access a dictionary... that I remembered seeing the reference to dictionaries in the Function Mode doco last week.

Good luck.

BR
BetterRed is offline   Reply With Quote
Old 11-28-2014, 09:39 PM   #9
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,930
Karma: 22669822
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Here you go, I haven't really tested it, so you might have to adjust it a little:

Code:
import regex
from calibre import replace_entities, prepare_string_for_xml

def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
    def fix_word(m):
        word = m.group()
        if dictionaries.recognized(word):
            return word
        for i in xrange(1, len(word) - 1):
            a, b = word[:i], word[i:]
            if dictionaries.recognized(a) and dictionaries.recognized(b):
                return a + ' ' + b
        return word
    text = replace_entities(match.group(1))
    text = regex.sub(r'\b\w+\b', fix_word, text, flags=regex.VERSION1)
    text = prepare_string_for_xml(text)
    return '>' + text + '<'

Use it with the find expression

>([^<]+)<
kovidgoyal is offline   Reply With Quote
Old 11-29-2014, 02:57 AM   #10
Paulie_D
Connoisseur
Paulie_D began at the beginning.
 
Paulie_D's Avatar
 
Posts: 67
Karma: 10
Join Date: Apr 2011
Device: Kindle 3, Samsung Tab 4
Wow!

I'll have a good play with this....thank you so very much Kovid.
Paulie_D is offline   Reply With Quote
Old 11-29-2014, 05:55 AM   #11
roger64
Wizard
roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.
 
Posts: 2,608
Karma: 3000161
Join Date: Jan 2009
Device: Kindle PW3 (wifi)
Hi

This kind of thing (two known words stuck together) happens quite often, most probably like BetterRed said, as the result of a botched scan.

I tried to make work the above function - I use a French dictionary with the Calibre spellchecker - but I failed (it reported it found nothing when I had a glaring example under the nose).

I probably missed something obvious. I use Linux Mint 17 and I have some Python inside it...

Could a good soul provide a basic example of this function that we could replicate and maybe a screenshot?

Last edited by roger64; 11-29-2014 at 05:58 AM.
roger64 is offline   Reply With Quote
Old 12-05-2014, 08:41 AM   #12
Phssthpok
Age improves with wine.
Phssthpok knows how to set a laser printer to stun.Phssthpok knows how to set a laser printer to stun.Phssthpok knows how to set a laser printer to stun.Phssthpok knows how to set a laser printer to stun.Phssthpok knows how to set a laser printer to stun.Phssthpok knows how to set a laser printer to stun.Phssthpok knows how to set a laser printer to stun.Phssthpok knows how to set a laser printer to stun.Phssthpok knows how to set a laser printer to stun.Phssthpok knows how to set a laser printer to stun.Phssthpok knows how to set a laser printer to stun.
 
Posts: 558
Karma: 95229
Join Date: Nov 2014
Device: Kindle Oasis, Kobo Libra II
I just tried it and it worked fine... although (with an English book) it split the name "Tula" into "Tu" and "la", as if it was using a French dictionary!

Incidentally, when a function like this doesn't work, is there any way to debug it?
Phssthpok is offline   Reply With Quote
Old 12-05-2014, 08:58 AM   #13
roger64
Wizard
roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.
 
Posts: 2,608
Karma: 3000161
Join Date: Jan 2009
Device: Kindle PW3 (wifi)
Quote:
Originally Posted by Phssthpok View Post
I just tried it and it worked fine.../...
I created a new function using the text above, and I tried with the 'find' expression
Quote:
>([^<]+)<
, and it found nothing.

Please, could you tell me exactly what you did you write in 'Find' field?
roger64 is offline   Reply With Quote
Old 12-05-2014, 11:51 AM   #14
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,930
Karma: 22669822
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
http://manual.calibre-ebook.com/func...your-functions
kovidgoyal is offline   Reply With Quote
Old 12-05-2014, 05:19 PM   #15
BetterRed
null operator (he/him)
BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.
 
Posts: 20,625
Karma: 26960534
Join Date: Mar 2012
Location: Sydney Australia
Device: none
Quote:
Originally Posted by Phssthpok View Post
I just tried it and it worked fine... although (with an English book) it split the name "Tula" into "Tu" and "la", as if it was using a French dictionary!
"Tu la" sounds more like Singaporean Latin to me

BR
BetterRed is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Using regex for more elegant hyphenation and word wrap Psymon Sigil 23 12-01-2014 07:27 PM
Glo Bug or common, quote split from word? Ripplinger Kobo Reader 4 07-05-2013 08:38 PM
Regex to insert word at beginning of a line macnab69 Library Management 1 05-20-2013 02:56 AM
split function bug ? cybmole Sigil 6 01-13-2011 12:05 PM


All times are GMT -4. The time now is 09:23 AM.


MobileRead.com is a privately owned, operated and funded community.