adjusting a function

roger64 · 09-19-2016, 03:29 PM

Hi

Some months ago, you gave us a nice function which allowed to split words "glued" together. After a mistake of mine, I had the opportunity to use this function on a lot of words on a French EPUB. I have of course installed a French dictionary. Please read on...

The results were amazingly good and quick.

Spoiler:

Using the Calibre Editor spell checker before and after the use of this function, I could see that the number of words unknown to the dictionary went down from 1167 to 261. Taking into account the fact that probably 2/3 of the remaining ones were "noms propres" (proper nouns ?), I nevertheless realized that some few words had not been split (50 to 70 probably).

The cause was related with some kind of elided form. Here are some of them. One can easily discern the same pattern: a word followed by one letter and one curved apostrophe (in red here); these last two elements being characteristic of elided forms in French.

accompagnentn’auront
àl’origine
dansl’entrée
dem’expliquer
des’opposer
Etj’écrasai
ils’attendait
manueld’algèbre

What makes me hope that the function could be improved so as to take care of elided forms is that for all of them, the first suggestion of the dictionary of the Calibre editor is to split them correctly.

kovidgoyal · 09-20-2016, 04:11 AM

@roger64: Sorry, I'm a little swaamped at the moment, so I dont have time to look at this, hopefully someone else will be able to help you.

roger64 · 09-20-2016, 05:52 AM

Thanks for your reply. I hope too.

phossler · 09-20-2016, 01:40 PM

@roger64 - what is a link to the post that had the function?

My Search didn't turn up anything

roger64 · 09-20-2016, 03:19 PM

Here is the thread.
https://www.mobileread.com/forums/sho...d.php?t=251941

roger64 · 09-21-2016, 03:33 AM

Hi

and here is a new version of this function taking into account the elided forms (at least for French language) thanks to Olivier, the author of -opensource- Grammalecte.

Spoiler:

or here: http://pastebin.com/quGQQzcN

phossler · 09-21-2016, 11:34 AM

Thanks.

Small issue

If there's <style> in the html file, the [Replace All] will process that text and generate errors

For example, I had a jacket.xhtml file in an epub, and the RE split some things that generated errors (not typos or mis-spellings)

Spoiler:

Any way to make the function a little smarter?

I can always regenerate the jacket.xhtml file, but any other files that have <style> in would probably be changed also

09-19-2016, 03:29 PM	#1
roger64 Wizard Posts: 2,625 Karma: 3120635 Join Date: Jan 2009 Device: Kindle PW3 (wifi)	adjusting a function Hi Some months ago, you gave us a nice function which allowed to split words "glued" together. After a mistake of mine, I had the opportunity to use this function on a lot of words on a French EPUB. I have of course installed a French dictionary. Please read on... The results were amazingly good and quick. Spoiler: Code: >([^<]+)< Code: import regex from calibre import replace_entities, prepare_string_for_xml def replace(match, number, file_name, metadata, dictionaries, data, functions, args, kwargs): def fix_word(m): word = m.group() if dictionaries.recognized(word): return word for i in xrange(1, len(word) - 1): a, b = word[:i], word[i:] if dictionaries.recognized(a) and dictionaries.recognized(b): return a + ' ' + b return word text = replace_entities(match.group(1)) text = regex.sub(r'\b\w+\b', fix_word, text, flags=regex.VERSION1) text = prepare_string_for_xml(text) return '>' + text + '<' Using the Calibre Editor spell checker before and after the use of this function, I could see that the number of words unknown to the dictionary went down from 1167 to 261. Taking into account the fact that probably 2/3 of the remaining ones were "noms propres" (proper nouns ?), I nevertheless realized that some few words had not been split (50 to 70 probably). The cause was related with some kind of elided form. Here are some of them. One can easily discern the same pattern: a word followed by one letter and one curved apostrophe (in red here); these last two elements being characteristic of elided forms in French. accompagnentn’auront àl’origine dansl’entrée dem’expliquer des’opposer Etj’écrasai ils’attendait manueld’algèbre What makes me hope that the function could be improved so as to take care of elided forms is that for all of them, the first suggestion* of the dictionary of the Calibre editor is to split them correctly. Last edited by roger64; 09-19-2016 at 03:39 PM.

09-21-2016, 11:34 AM	#7
phossler Wizard Posts: 1,087 Karma: 447222 Join Date: Jan 2009 Location: Valley Forge, PA, USA Device: Kindle Paperwhite	Thanks. Small issue If there's <style> in the html file, the [Replace All] will process that text and generate errors For example, I had a jacket.xhtml file in an epub, and the RE split some things that generated errors (not typos or mis-spellings) Spoiler: <style type="text/css"> .cbj_banner { background: #eee; col or: black; border: thin solid black; margin: 1 em; padding: 1 em; } table.cbj_header td.cbj_title { font-size: 1.5 em; font-style: italic; text-align: c enter; } table.cbj_header td.cbj_series { text-align: c enter; } table.cbj_header td.cbj_author { text-align: c enter; } table.cbj_header td.cbj_pubdata { text-align: c enter; } table.cbj_header { width: 100%; } table.cbj_header td.cbj_label { text-align: right; width: 33%; } table.cbj_header td.cbj_content { text-align: left; width: 67%; } hr.metadata_divider { width: 90%; margin-left: 5%; border-top: solid white 0; border-right: solid white 0; border-bottom: solid black 1px; border-left: solid white 0; } hr { border-top: 0 solid white; border-right: 0 solid white; border-bottom: 2px solid black; border-left: 0 solid white; margin-left: 10%; width: 80%; } .cbj_footer { font-size: 0.8 em; margin-top: 8px; text-align: c enter; } </style> Any way to make the function a little smarter? I can always regenerate the jacket.xhtml file, but any other files that have <style> in would probably be changed also Last edited by phossler; 09-21-2016 at 11:45 AM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Adjusting NOOK sensitivity	SKYRIDER	Barnes & Noble NOOK	7	02-28-2016 12:32 PM
Aura HD Adjusting font color	joanmed	Kobo Reader	17	10-24-2014 12:53 AM
Adjusting Cover Size	jhempel24	ePub	17	01-20-2012 05:20 PM
Adjusting contrast on ereader?	aidren	enTourage Archive	3	10-06-2010 06:31 PM
Help adjusting line size, please.	Stitchawl	Calibre	4	04-05-2009 10:53 PM

09-20-2016, 04:11 AM	#2
kovidgoyal creator of calibre Posts: 45,367 Karma: 27230406 Join Date: Oct 2006 Location: Mumbai, India Device: Various	@roger64: Sorry, I'm a little swaamped at the moment, so I dont have time to look at this, hopefully someone else will be able to help you.

09-20-2016, 05:52 AM	#3
roger64 Wizard Posts: 2,625 Karma: 3120635 Join Date: Jan 2009 Device: Kindle PW3 (wifi)	Thanks for your reply. I hope too.

09-20-2016, 01:40 PM	#4
phossler Wizard Posts: 1,087 Karma: 447222 Join Date: Jan 2009 Location: Valley Forge, PA, USA Device: Kindle Paperwhite	@roger64 - what is a link to the post that had the function? My Search didn't turn up anything

09-20-2016, 03:19 PM	#5
roger64 Wizard Posts: 2,625 Karma: 3120635 Join Date: Jan 2009 Device: Kindle PW3 (wifi)	Here is the thread. https://www.mobileread.com/forums/sho...d.php?t=251941

Advert

Advert