Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Editor

Notices

Reply
 
Thread Tools Search this Thread
Old 09-19-2016, 03:29 PM   #1
roger64
Wizard
roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.
 
Posts: 2,624
Karma: 3120635
Join Date: Jan 2009
Device: Kindle PW3 (wifi)
adjusting a function

Hi

Some months ago, you gave us a nice function which allowed to split words "glued" together. After a mistake of mine, I had the opportunity to use this function on a lot of words on a French EPUB. I have of course installed a French dictionary. Please read on...
The results were amazingly good and quick.

Spoiler:

Code:
>([^<]+)<
Code:
import regex
from calibre import replace_entities, prepare_string_for_xml

def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
    def fix_word(m):
        word = m.group()
        if dictionaries.recognized(word):
            return word
        for i in xrange(1, len(word) - 1):
            a, b = word[:i], word[i:]
            if dictionaries.recognized(a) and dictionaries.recognized(b):
                return a + ' ' + b
        return word
    text = replace_entities(match.group(1))
    text = regex.sub(r'\b\w+\b', fix_word, text, flags=regex.VERSION1)
    text = prepare_string_for_xml(text)
    return '>' + text + '<'


Using the Calibre Editor spell checker before and after the use of this function, I could see that the number of words unknown to the dictionary went down from 1167 to 261. Taking into account the fact that probably 2/3 of the remaining ones were "noms propres" (proper nouns ?), I nevertheless realized that some few words had not been split (50 to 70 probably).

The cause was related with some kind of elided form. Here are some of them. One can easily discern the same pattern: a word followed by one letter and one curved apostrophe (in red here); these last two elements being characteristic of elided forms in French.

accompagnentn’auront
àl’origine
dansl’entrée
dem’expliquer
des’opposer
Etj’écrasai
ils’attendait
manueld’algèbre

What makes me hope that the function could be improved so as to take care of elided forms is that for all of them, the first suggestion of the dictionary of the Calibre editor is to split them correctly.

Last edited by roger64; 09-19-2016 at 03:39 PM.
roger64 is offline   Reply With Quote
Old 09-20-2016, 04:11 AM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 45,251
Karma: 27110894
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
@roger64: Sorry, I'm a little swaamped at the moment, so I dont have time to look at this, hopefully someone else will be able to help you.
kovidgoyal is offline   Reply With Quote
Advert
Old 09-20-2016, 05:52 AM   #3
roger64
Wizard
roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.
 
Posts: 2,624
Karma: 3120635
Join Date: Jan 2009
Device: Kindle PW3 (wifi)
Thanks for your reply. I hope too.
roger64 is offline   Reply With Quote
Old 09-20-2016, 01:40 PM   #4
phossler
Wizard
phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.
 
Posts: 1,087
Karma: 447222
Join Date: Jan 2009
Location: Valley Forge, PA, USA
Device: Kindle Paperwhite
@roger64 - what is a link to the post that had the function?

My Search didn't turn up anything
phossler is offline   Reply With Quote
Old 09-20-2016, 03:19 PM   #5
roger64
Wizard
roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.
 
Posts: 2,624
Karma: 3120635
Join Date: Jan 2009
Device: Kindle PW3 (wifi)
Here is the thread.
https://www.mobileread.com/forums/sho...d.php?t=251941
roger64 is offline   Reply With Quote
Advert
Old 09-21-2016, 03:33 AM   #6
roger64
Wizard
roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.
 
Posts: 2,624
Karma: 3120635
Join Date: Jan 2009
Device: Kindle PW3 (wifi)
Hi

and here is a new version of this function taking into account the elided forms (at least for French language) thanks to Olivier, the author of -opensource- Grammalecte.

Spoiler:

Code:
import regex
from calibre import replace_entities, prepare_string_for_xml
 
def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
    def fix_word(m):
        word = m.group()
        if dictionaries.recognized(word):
            return word
        for i in xrange(1, len(word) - 1):
            a, b = word[:i], word[i:]
            if dictionaries.recognized(a) and dictionaries.recognized(b):
                return a + ' ' + b
        m = regex.match(r"(\w+)((?:[dlnmts]|qu(?:oi|el)qu|puisqu|lorsqu|jusqu|qu)[’'`]\w+)", word)
        if m:
            return m.group(1) + " " + m.group(2)
        return word
    text = replace_entities(match.group(1))
    text = regex.sub(r"\b\w(?:[\w’'`-]*\w|\w+)\b", fix_word, text, flags=regex.VERSION1)
    text = prepare_string_for_xml(text)
    return '>' + text + '<'

or here: http://pastebin.com/quGQQzcN

Last edited by roger64; 09-21-2016 at 06:10 AM. Reason: pastebin
roger64 is offline   Reply With Quote
Old 09-21-2016, 11:34 AM   #7
phossler
Wizard
phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.
 
Posts: 1,087
Karma: 447222
Join Date: Jan 2009
Location: Valley Forge, PA, USA
Device: Kindle Paperwhite
Thanks.

Small issue

If there's <style> in the html file, the [Replace All] will process that text and generate errors

For example, I had a jacket.xhtml file in an epub, and the RE split some things that generated errors (not typos or mis-spellings)


Spoiler:

<style type="text/css">
.cbj_banner {
background: #eee;
col or: black;
border: thin solid black;
margin: 1 em;
padding: 1 em;
}
table.cbj_header td.cbj_title {
font-size: 1.5 em;
font-style: italic;
text-align: c enter;
}
table.cbj_header td.cbj_series {
text-align: c enter;
}
table.cbj_header td.cbj_author {
text-align: c enter;
}
table.cbj_header td.cbj_pubdata {
text-align: c enter;
}
table.cbj_header {
width: 100%;
}
table.cbj_header td.cbj_label {
text-align: right;
width: 33%;
}
table.cbj_header td.cbj_content {
text-align: left;
width: 67%;
}
hr.metadata_divider {
width: 90%;
margin-left: 5%;
border-top: solid white 0;
border-right: solid white 0;
border-bottom: solid black 1px;
border-left: solid white 0;
}
hr {
border-top: 0 solid white;
border-right: 0 solid white;
border-bottom: 2px solid black;
border-left: 0 solid white;
margin-left: 10%;
width: 80%;
}
.cbj_footer {
font-size: 0.8 em;
margin-top: 8px;
text-align: c enter;
}
</style>


Any way to make the function a little smarter?

I can always regenerate the jacket.xhtml file, but any other files that have <style> in would probably be changed also

Last edited by phossler; 09-21-2016 at 11:45 AM.
phossler is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Adjusting NOOK sensitivity SKYRIDER Barnes & Noble NOOK 7 02-28-2016 12:32 PM
Aura HD Adjusting font color joanmed Kobo Reader 17 10-24-2014 12:53 AM
Adjusting Cover Size jhempel24 ePub 17 01-20-2012 05:20 PM
Adjusting contrast on ereader? aidren enTourage Archive 3 10-06-2010 06:31 PM
Help adjusting line size, please. Stitchawl Calibre 4 04-05-2009 10:53 PM


All times are GMT -4. The time now is 05:50 PM.


MobileRead.com is a privately owned, operated and funded community.