|
![]() |
|
Thread Tools | Search this Thread |
![]() |
#1 |
Member
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 24
Karma: 111614
Join Date: Mar 2025
Location: Poland
Device: Kindle Voyage
|
[Plugin] ForeignWords – Marks words/phrases in a foreign language with the span tag
Updated: April 3, 2025
Current Version: "0.0.107" Status: BETA5 Plugin type: edit Plugin icon: License/Copying: GNU LGPL Version 2 or Version 3 your choice. Any other license terms are only available directly from the author in writing. Change Log:
Warning: Since the plugin is in beta and could potentially work in unpredictable ways, be sure to back up EPUB files and/or use checkpoints. Introduction: The first versions of the plugin were created in 2020, but the work was abandoned. I returned to the idea in 2024 and used the working version privately. In 2025, I finally decided to register on MR and present the plugin to the world. How it works: The plugin is Sigil-centric, so it uses user dictionaries (file names without extensions). The dictionaries are cleverly created by Sigil, so I don't have to worry about duplicates, for example. Based on the words in the user dictionary, the plugin searches the selected (or all) xhtml files and looks for words in the text. Once a word is found, the plugin doesn't rest on its laurels, but checks for earlier and later words in the dictionary. This way, the plugin finds whole phrases. Once all the words and phrases are determined, the main action begins. The found words/phrases are surrounded by the SPAN tag along with the selected class, lang and xml:lang attributes. Each attribute is treated individually, so you can select all or none. In the latter case, the "naked" SPAN will be applied. If the word/phrase has already been surrounded by a SPAN tag then [NOTE! THIS IS IMPORTANT!] the existing attributes will be updated or removed – according to the current plugin configuration. The plugin can also work without a dictionary and then it searches in the text for all existing span tags with the declared language in the "lang" or "xml:lang" attribute. We can substitute language attributes in such spans and add or remove a class. Work on one, several or all xhtml files. Spoiler:
The configuration window icon also determines the status of the plugin. Beta version: Stable version: Languages: The user can freely edit the list of languages that will be available in the drop-down list. If the main text is in English (or any other language derived from Latin), but contains inserts in Russian, Georgian, Hebrew, Arabic, Korean, Chinese, or Japanese it is relatively easy to detect them in the SpellCheck window and add them to the corresponding user dictionary. anguages that have identical words are sometimes problematic, because they can be attributed to a foreign language, although they also occur in the main language of the book. UI translation: The plugin is prepared to translate the interface, but I think that possible willing translators can do it when the stable version is published. Credits/Thanks:
Installation: 1. Select Manage Plugins from the Plugins menu. In the Manage Plugins dialog box, select Use Bundled Python, if it isn't already selected. (If your Sigil version doesn't have a Use Bundled Python option, click one of the Auto buttons to detect the path or Set to manually select the Python interpreter path.) 2. Click Add Plugin and select ForeignWords_v0.X.Y.zip. This will install the ForeignWords plugin, which you can select via Plugins > Edit > ForeignWords. Issues: Please note that the plugin MAY have a destructive effect on existing SPAN tags that contain class, lang and xml:lang attributes. While the plugin is running, if only a matching word is found then the existing attributes are replaced (or removed!) according to the configuration. If existing attributes in SPAN tags are also sorted alphabetically, making it easy to merge them into longer phrases covered by the same tags with identical attributes. If you want to keep the existing "class" attribute, enable the "Ignore CSS Class" option. OS Requirements: Windows/Linux/OS X. I tested this plugin on Windows 10 and Windows 11 with Sigil 2.3.0, 2.4.0, 2.4.2, pre-2.5.0, but it should work the same on Linux and macOS. *** Linux users will have to make sure that the PyQt5 graphical Python module (or PySide6 module starting with Sigil 2.0) is present if it's not already. *** Please feel free to give me feedback. Sigil Requirements: I have set the minimum version of Sigil at 1.0.0, but I still need to check it more carefully. Here I ask for feedback on whether the plugin works properly in Sigil version 1.X. Links to related posts/threads: Last edited by Haudek; 04-03-2025 at 05:40 AM. Reason: Jump version or edit description |
![]() |
![]() |
![]() |
#2 |
Member
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 24
Karma: 111614
Join Date: Mar 2025
Location: Poland
Device: Kindle Voyage
|
Foreign Words Plugin for Sigil – Short Description
Overview
The Foreign Words plugin for Sigil automatically identifies and marks foreign words or phrases within EPUB files. It uses user-defined dictionaries to detect foreign expressions and applies customizable HTML markup (<span> tags) to these expressions, facilitating consistent styling and language identification. Main Features
Configuration Options The plugin provides a graphical interface allowing users to configure plugin options easily. The interface dynamically updates a preview of the resulting <span> markup based on current settings. Users can customize the following settings: Use Dictionary File If the option is enabled, words/phrases are searched based on the user's dictionary, and if disabled, we look for existing <span> tags in the document according to the selected language through the drop-down list "Look For This Language Code" User Dictionary File Select the dictionary file containing foreign words (only filenames without extension) Ignore CSS Class You can ignore existing classes in <span> tags Use CSS Class Enable/disable adding a CSS class to marked expressions CSS Class Name Name of the CSS class applied to marked expressions Use Lang Attribute Enable/disable adding lang attribute Use xml:lang Attribute Enable/disable adding xml:lang attribute Language Code Language code used in attributes (lang and xml:lang) Combine Words Combine adjacent foreign words into phrases Merge Spans Merge adjacent <span> tags with identical attributes Run Without Confirmation Execute plugin without confirmation dialogs Debug Mode Enable detailed logging for troubleshooting. Useful information, but more for developers. What exactly does the plugin do?
Code:
<span class="english" lang="en" xml:lang="en">foreign phrase</span>
Applies to plugin version: 0.0.107 (beta6) Last edited by Haudek; 04-03-2025 at 05:40 AM. Reason: Editing the description |
![]() |
![]() |
![]() |
#3 |
Sigil Developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 9,069
Karma: 6361556
Join Date: Nov 2009
Device: many
|
Thank you for contributing your plugin to help others! I am travelling with only a phone now but upon my return, I will happily test and report back.
|
![]() |
![]() |
![]() |
#4 |
Member
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 24
Karma: 111614
Join Date: Mar 2025
Location: Poland
Device: Kindle Voyage
|
Sure.
I got a sample file and I already know that there are situations in which the plugin will not be very useful. It is about the situation where we want to mark languages with the same words ("la" in Italian and French). It is most useful when the whole book is in one language, but has phrases in another, for example, one of the characters keeps inserting some German or French sentences. Nevertheless, checking that the attributes are added correctly is necessary. Checkpoints will be perfect for checking the differences. |
![]() |
![]() |
![]() |
#5 |
Member
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 24
Karma: 111614
Join Date: Mar 2025
Location: Poland
Device: Kindle Voyage
|
Sorry for the post under post, but I can't edit them yet (I'm fresh on MR).
I did quite a bit of experimentation today and it turns out that languages very different from English are supported by the plugin. That is, if the main text is in English, but contains inserts in Russian, Georgian, Hebrew, Arabic, Korean, Chinese, or Japanese it is relatively easy to detect them in the SpellCheck window and add them to the corresponding user dictionary. The plugin will take care of the rest. The problem, then, is more when the two languages share a certain word base, because the plugin is unable to recognize whether the word in question is in the publication's default language or in a language where an identical word exists. |
![]() |
![]() |
![]() |
#6 |
Member
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 24
Karma: 111614
Join Date: Mar 2025
Location: Poland
Device: Kindle Voyage
|
Version 0.0.82 (Beta 2).
Users are invited to test the Beta2 version. Last edited by Haudek; 03-19-2025 at 08:15 PM. |
![]() |
![]() |
![]() |
#7 |
Member
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 24
Karma: 111614
Join Date: Mar 2025
Location: Poland
Device: Kindle Voyage
|
Version 0.0.83 (Beta 3).
See the second post of this thread for a description of the plugin. |
![]() |
![]() |
![]() |
#8 |
Addict
![]() ![]() ![]() ![]() ![]() ![]() Posts: 304
Karma: 516
Join Date: Nov 2015
Location: Europe EEC
Device: Kindle Fire HD6 & HD8
|
I've just given your plugin a test drive on one xhtml file which has some parts in French. Plugin runner did not give a clear indication that the job was done. At the top it showed Status: success but the last entry in the main part of the window said "INFO: Saving file ...."
I waited a bit expecting the runner to close but it didn't. The plugin appeared to work as designed but the result was a solid block of HTML in code view - very difficult to read with no white space. Before running the plugin, Sigil had prettified Code View but the plugin undid that. All the <spans> were not merged but some were so the plugin did work as scheduled. Some words were not recognised as French. The problem undoubtedly lies with my dictionaries. I confess that my treatment of spellcheck over the years has not been ideal and my dictionaries are probably in a deplorable state. I write non-fiction so I have a load of jargon words and abbreviations particular to the subject matter. That has lead me to ignore or mistreat the spellcheck over the years. Before trying again, I think I'll have to 'fix' my dictionaries. |
![]() |
![]() |
![]() |
#9 |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 28,848
Karma: 207000000
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
|
They're probably just leaving the Runner visible during the beta to make it easier to see/report debug info. Unless the autoclose value is set to true in a plugin's plugin.xml file, they will all behave that way. We didn't even have an autostarting/autoclosing Plugin Runner feature in the early plugin days. Had to start them with a button and close them with a button.
|
![]() |
![]() |
![]() |
#10 |
Sigil Developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 9,069
Karma: 6361556
Join Date: Nov 2009
Device: many
|
And if anyone is interested, we have added an xhtml pretty routine to our sigil_bs4 fork.
Code:
def prettyprint_xhtml(self, indent_level=0, eventual_encoding=DEFAULT_OUTPUT_ENCODING, formatter="minimal", indent_chars=" ") |
![]() |
![]() |
![]() |
#11 | |
Addict
![]() ![]() ![]() ![]() ![]() ![]() Posts: 304
Karma: 516
Join Date: Nov 2015
Location: Europe EEC
Device: Kindle Fire HD6 & HD8
|
Quote:
I tried to clean up my default and french dictionaries. I ran spellcheck and in the french half of the xhtml file, every word underlined by the spellcheck I added to the french user dictionary. When spell check failed to underline any word, I looked in the word list for the french user dictionary (Preferences) and the words appeared there ok. Then I ran the plugin but it failed to merge all the <spans>. Its merge action gets interrupted by four types of occurrence (so far): 1. the occasional word which is not identified as french and not identified as misspelt by spellchecker 2. words which are the same spelling in en and fr (although the meaning is different). This includes number digits. 3. words within <i> tags 4. words within <b> tags. There's not much can be done for words which might be in either language but it takes too much cleaning up for variants such as plurals, words starting with an upper case letter, occasions where a letter is replaced with ' such as d'exercices. For my user case, where I'm trying to improve usability with conversion to epub3, it's much simpler to wrap the entire half-chapter into a <section> with language specifiers included in the section tag. |
|
![]() |
![]() |
![]() |
#12 |
Addict
![]() ![]() ![]() ![]() ![]() ![]() Posts: 304
Karma: 516
Join Date: Nov 2015
Location: Europe EEC
Device: Kindle Fire HD6 & HD8
|
Although I said previously that my treatment of spellcheck had been poor over the years, I did not realise how deficient I was. I found an interesting and informative thread about spellcheck which talked about scowls etc.
I persisted in trying to get the plugin to complete the <span> merges but ended in thinking that somehow it was prevented by the spellcheck system. Whenever I invoke spellcheck in Sigil, it produces a list of words which it doubts but which it lists as EN (probably because the base language of the epub is listed as EN). I go through this list and add the words which are in fact FR to my FR user dictionary and those which are EN to my EN user dictionary. I save everything, but the next invocation of spellcheck produces the same list of words. I check in Sigil's preferences for the contents of the user dictionaries and the words were in fact present there. The 'add to dictionary' had worked. When I invoke spellcheck and click on a French word in its list, it takes me to the instance in code view and that is inevitably in a break between the plugins <span> merges. It looks like spellcheck ignores the user dictionaries and this is one of the causes of the plugin's failures to merge all spans. |
![]() |
![]() |
![]() |
#13 |
Sigil Developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 9,069
Karma: 6361556
Join Date: Nov 2009
Device: many
|
No spellcheck does not ignore user dictionaries that are properly specified but they do follow and use the lang and xml:lang attributes religiously to determine which dictionary to spellcheck a potential word in.
But in this plugin's case, there are no xml:lang or lang attributes on span tags on foreign words so it will default to either the xml:lang or lang attributes on the html tag or if those are not present, the first dc:language metadata tag in the opf. So it is a bit of a catch-22 here. You need the lang attributes to know what dictionary to look a word up in but you need spellcheck to determine if a word is potentially a foreign word or just an incorrectly spelled one. Detecting single possible foreign words is possible with only one dictionary and you can generate a word list of probable foreign words. But until spans are added and merged and xml:lang/lang attributes for the proper foreign language added, there is no way to tell which dictionary to look up a word in. I think the best approach is multi-pass. First to use a single dictionary to create a list of single foreign words, wrap those in span tags and add proper lang and xml:lang attributes to each. Next pass is to use the find foreign words variant of this plugin, or just search for xml:lang and visually/manually fix any intervening adjacent words that were not properly detected to add the proper span and lang info. Final pass, merge adjacent spans with matching xml:lang attributes. Plugins can in fact use hunspell spellchecking directly, and you could inside a plugin force lookup a word to see if it exists in both dictionaries and if adjacent to an existing span, add it. Then the entire process would be done inside the plugin. Last edited by KevinH; 03-23-2025 at 09:12 AM. |
![]() |
![]() |
![]() |
#14 |
Member
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 24
Karma: 111614
Join Date: Mar 2025
Location: Poland
Device: Kindle Voyage
|
Thank you for your comments, most of which I obviously agree with.
All cases described are familiar to me and it is supposed to work just like that. Of course, I still have some ideas in my head that may (or may not) improve the performance of the merging function. As for the lack of feedback to the user "what has been done" I have already changed that and in beta4 (0.0.94) it should be better. The ideas submitted are very interesting and I will certainly consider some of them. Last edited by Haudek; 03-23-2025 at 05:42 PM. |
![]() |
![]() |
![]() |
Tags |
lang, language code, plugin, sigil, span |
Thread Tools | Search this Thread |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
import changes foreign language to nonsense words | dandman | Library Management | 9 | 05-15-2024 07:01 PM |
Buying books: words or phrases that can turn a 'yes' into a 'no'? | ZodWallop | General Discussions | 133 | 09-13-2020 11:25 AM |
Writing Phrases and Words I Detest | Dr. Drib | Writers' Corner | 98 | 10-12-2017 10:24 AM |
One-touch look-up of words in foreign language book | andrewkirk | Kobo Reader | 8 | 06-09-2015 08:20 AM |
Common words/phrases too aggressively italicized. | carnivore | Conversion | 2 | 02-11-2011 06:36 PM |