01-21-2023, 07:43 PM | #1 |
Guru
Posts: 734
Karma: 1077122
Join Date: Sep 2013
Device: Kobo Forma
|
Search for All French Words in English Book?
Does anyone know of a reasonable way to search an english epub for French words? Right now, I'm working on some of Christie's "Poirot" novels and it would be nice to span those little phrases with lang="fr". I'm sure I can catch most of them by searching for some of Poirot's more common French blurbs. But, I was wondering if there was some more sure-fire way to do it.
EDIT: I should have thought of this earlier. If I'm lucky, either the publisher or Calibre will have enclosed the French stuff in italics. I'll search for <i> and, where appropriate, replace it with <i lang="fr">. Last edited by enuddleyarbl; 01-21-2023 at 07:51 PM. |
01-21-2023, 07:56 PM | #2 |
Wizard
Posts: 1,178
Karma: 4949904
Join Date: Sep 2021
Location: Australia
Device: Kobo Libra 2
|
Probably better to use...
PHP Code:
|
01-21-2023, 08:30 PM | #3 |
Addict
Posts: 287
Karma: 2534928
Join Date: Nov 2022
Location: Canada
Device: Kobo Aura 2
|
Are you expecting to find many French words together, or are you looking for loanwords? It seems like loanword detection is an open research problem, but if you're up for writing a bit of code, I found a few options for doing general language detection:
If you are comfortable with Python, then langdetect (a port of this Java library, if you prefer Java); a similar option implemented as part of the spaCy NLP framework, spacy-langdetect; and textblob (which appears to farm out the language detection to the Google Translate API). Langdetect seems nice and simple, but you still need to figure out how to walk over the words in your book, so spaCy might be a better choice for that, as it comes with sentence segmentation. Good luck! Last edited by isarl; 01-21-2023 at 08:33 PM. Reason: added mention of sentence segmentation |
01-21-2023, 10:41 PM | #4 | |
Guru
Posts: 734
Karma: 1077122
Join Date: Sep 2013
Device: Kobo Forma
|
Quote:
Unfortunately, I'd previously accepted a lot of those french words as OK in the spellchecker, so, unless I clear that dictionary, I'd be missing them. Plus, being able to already have the <i> selected makes that much easier (though searching through all the false positives does take some time. #isarl: those tools are good ideas, but much more work than I'm willing to do. Thanks for the suggestions, though. |
|
01-21-2023, 11:47 PM | #5 | |
Wizard
Posts: 1,178
Karma: 4949904
Join Date: Sep 2021
Location: Australia
Device: Kobo Libra 2
|
Quote:
I leave the main reference dictionary untouched. Instead I add to the "temp" dictionary. Every now and then I delete the temp dictionary and start again. |
|
01-22-2023, 01:52 PM | #6 |
Guru
Posts: 734
Karma: 1077122
Join Date: Sep 2013
Device: Kobo Forma
|
On a quasi-related note, is there some construct to translate little blurbs of text for the reader? If these were full-fledged untranslated paragraphs of French, I'd run it through Google Translate and stick the result in a footnote. But, for these little Poirot exclamations, most of them are trivial and only some use words I don't recognize.
I was thinking of commandeering <abbr title="...translation...">french phrase</abbr>, but though it works in Calibre, it doesn't as a kepub on my Forma. A pure <aside>...translation...</aside> just breaks the paragraph and dumps it right there. <ruby>french phrase<rt>the translation</rt></ruby> looked interesting, but it puts the individual translated words over their corresponding untranslated words (instead of just putting the whole translation over the untranslated phrase). Right now, the best I can come up with is a full-fledged footnote to properly set the translation off from the paragraph. |
01-22-2023, 03:18 PM | #7 | |||||
Wizard
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
I described how to use Sigil's/Calibre's Spellcheck Lists in order to tag each Spanish/"foreign word" with an HTML language. You could then use some regex to merge everything together. - - - I even wrote another tutorial showing you how you can use 2 dictionaries to quickly spot "foreign words" too: In that case, I used the trick to quickly find all British<->American spellings. Quote:
Code:
<i lang="fr" xml:lang="fr"> Code:
<i class="french" lang="fr" xml:lang="fr"> Then if you want all your French words to be red? Very simple to understand CSS: Code:
.french { color: red; } Side Note: If you want even more on proper HTML language markup, type this into your favorite search engine: Code:
xml:lang Tex2002ans site:mobileread.com - - - Quote:
allows you to Auto-Translate text, inline, similar to Google Translate on a webpage. You could also press+hold, then send the highlighted text to a translation site too. (In PocketBook, you can also choose which engine you want to use, like DeepL, Google Translate, Bing Translate, etc.) Quote:
But if it's an ebook for actual sale, DO NOT use those hackish <abbr> or <ruby> methods. If you device doesn't have the Auto-Translate stuff, you could also do something like shoving the translation right after + in a different font: Code:
« Je parle français! » (I speak French!) Code:
<span class="french" lang="fr" xml:lang="fr">« Je parle français! »</span> <span class="translated">(I speak French!)</span> Code:
span.translated { font-weight: bold; } Quote:
Similar to when I run across Greek or Japanese or Chinese in my books. I just nod my head... then continue reading. (But not until I properly tag the language, of course!!!) Last edited by Tex2002ans; 01-22-2023 at 03:48 PM. |
|||||
01-22-2023, 05:01 PM | #8 |
Guru
Posts: 734
Karma: 1077122
Join Date: Sep 2013
Device: Kobo Forma
|
Code:
<i class="french" lang="fr" xml:lang="fr"> https://developer.mozilla.org/en-US/docs/Web/CSS/:lang The CSS syntax is: Code:
:lang(languagecode) { css declarations; } |
01-22-2023, 08:01 PM | #9 |
Wizard
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Yes, but .french = a simple CSS selector.
The :lang selector is much more advanced CSS3/CSS4... and older renderers might not be able to handle that. - - - It also helps make the code much more human-readable. Would you know what: Code:
lang="et" But if you saw: Code:
class="estonian" lang="et" Last edited by Tex2002ans; 01-22-2023 at 08:05 PM. |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Glo Provenance of French and French-English dictionaries | hans_n | Kobo Reader | 7 | 02-06-2016 03:05 AM |
English words of 'recent' origin | pdurrant | Lounge | 19 | 05-27-2014 07:50 AM |
Best eBook reader for reading French (English speaker learning French) | eVeNtInE | Which one should I buy? | 13 | 08-24-2012 04:25 AM |
Touch Dictionary only looks up for English words | frankieGom | Kobo Reader | 6 | 12-09-2011 02:52 PM |
Search for a good English<-> French dictionary | Cantrill | Amazon Kindle | 19 | 08-19-2011 09:52 AM |