Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Sigil > Plugins

Notices

Reply
 
Thread Tools Search this Thread
Old 03-15-2025, 11:01 AM   #1
Haudek
Member
Haudek knows the difference between a duck.Haudek knows the difference between a duck.Haudek knows the difference between a duck.Haudek knows the difference between a duck.Haudek knows the difference between a duck.Haudek knows the difference between a duck.Haudek knows the difference between a duck.Haudek knows the difference between a duck.Haudek knows the difference between a duck.Haudek knows the difference between a duck.Haudek knows the difference between a duck.
 
Posts: 23
Karma: 111614
Join Date: Mar 2025
Location: Poland
Device: Kindle Voyage
[Plugin] ForeignWords – Marks words/phrases in a foreign language with the span tag

Updated: April 3, 2025
Current Version: "0.0.107"
Status: BETA5
Plugin type: edit

Plugin icon:


License/Copying: GNU LGPL Version 2 or Version 3 your choice. Any other license terms are only available directly from the author in writing.

Change Log:
  • v0.0.XX - Private alpha versions
  • v0.0.57 - First public beta version
  • v0.0.82 - Public beta 2 version
  • v0.0.83 - Public beta 3 version
  • v0.0.94 - Public beta 4 version
  • v0.0.106 - Public beta 5 version
  • v0.0.107 - Public beta 6 version

Warning:
Since the plugin is in beta and could potentially work in unpredictable ways, be sure to back up EPUB files and/or use checkpoints.

Introduction:
The first versions of the plugin were created in 2020, but the work was abandoned. I returned to the idea in 2024 and used the working version privately. In 2025, I finally decided to register on MR and present the plugin to the world.

How it works:
The plugin is Sigil-centric, so it uses user dictionaries (file names without extensions). The dictionaries are cleverly created by Sigil, so I don't have to worry about duplicates, for example.
Based on the words in the user dictionary, the plugin searches the selected (or all) xhtml files and looks for words in the text.
Once a word is found, the plugin doesn't rest on its laurels, but checks for earlier and later words in the dictionary. This way, the plugin finds whole phrases. Once all the words and phrases are determined, the main action begins.
The found words/phrases are surrounded by the SPAN tag along with the selected class, lang and xml:lang attributes. Each attribute is treated individually, so you can select all or none. In the latter case, the "naked" SPAN will be applied.
If the word/phrase has already been surrounded by a SPAN tag then [NOTE! THIS IS IMPORTANT!] the existing attributes will be updated or removed – according to the current plugin configuration.

The plugin can also work without a dictionary and then it searches in the text for all existing span tags with the declared language in the "lang" or "xml:lang" attribute. We can substitute language attributes in such spans and add or remove a class.

Work on one, several or all xhtml files.
Spoiler:

If you select only one file (or a few) in the Book Browser window
Click image for larger version

Name:	sigil-foreignwords-selected-files-in-book-browser.png
Views:	28
Size:	6.9 KB
ID:	214490
then the plugin will operate only on that file:
Click image for larger version

Name:	sigil-foreignwords-selected-files-beta5.png
Views:	31
Size:	23.7 KB
ID:	214648

If you select the virtual folder "Text" in the Book Browser window
Click image for larger version

Name:	sigil-foreignwords-all-files-in-book-browser.png
Views:	37
Size:	5.5 KB
ID:	214491
the plugin will operate on all xhtml files:
Click image for larger version

Name:	sigil-foreignwords-all-files-beta5.png
Views:	26
Size:	23.7 KB
ID:	214649


The configuration window icon also determines the status of the plugin.

Beta version:

Stable version:

Languages:
The user can freely edit the list of languages that will be available in the drop-down list.
If the main text is in English (or any other language derived from Latin), but contains inserts in Russian, Georgian, Hebrew, Arabic, Korean, Chinese, or Japanese it is relatively easy to detect them in the SpellCheck window and add them to the corresponding user dictionary.
anguages that have identical words are sometimes problematic, because they can be attributed to a foreign language, although they also occur in the main language of the book.

UI translation:
The plugin is prepared to translate the interface, but I think that possible willing translators can do it when the stable version is published.

Credits/Thanks:
Installation:
1. Select Manage Plugins from the Plugins menu. In the Manage Plugins dialog box, select Use Bundled Python, if it isn't already selected. (If your Sigil version doesn't have a Use Bundled Python option, click one of the Auto buttons to detect the path or Set to manually select the Python interpreter path.)
2. Click Add Plugin and select ForeignWords_v0.X.Y.zip. This will install the ForeignWords plugin, which you can select via Plugins > Edit > ForeignWords.

Issues:
Please note that the plugin MAY have a destructive effect on existing SPAN tags that contain class, lang and xml:lang attributes. While the plugin is running, if only a matching word is found then the existing attributes are replaced (or removed!) according to the configuration.
If existing attributes in SPAN tags are also sorted alphabetically, making it easy to merge them into longer phrases covered by the same tags with identical attributes.
If you want to keep the existing "class" attribute, enable the "Ignore CSS Class" option.

OS Requirements:
Windows/Linux/OS X.
I tested this plugin on Windows 10 and Windows 11 with Sigil 2.3.0, 2.4.0, 2.4.2, pre-2.5.0, but it should work the same on Linux and macOS.
*** Linux users will have to make sure that the PyQt5 graphical Python module (or PySide6 module starting with Sigil 2.0) is present if it's not already. ***
Please feel free to give me feedback.

Sigil Requirements:
I have set the minimum version of Sigil at 1.0.0, but I still need to check it more carefully.
Here I ask for feedback on whether the plugin works properly in Sigil version 1.X.

Links to related posts/threads:
Attached Files
File Type: zip ForeignWords_v0.0.107.zip (37.2 KB, 17 views)

Last edited by Haudek; 04-03-2025 at 05:40 AM. Reason: Jump version or edit description
Haudek is offline   Reply With Quote
Old 03-15-2025, 04:52 PM   #2
Haudek
Member
Haudek knows the difference between a duck.Haudek knows the difference between a duck.Haudek knows the difference between a duck.Haudek knows the difference between a duck.Haudek knows the difference between a duck.Haudek knows the difference between a duck.Haudek knows the difference between a duck.Haudek knows the difference between a duck.Haudek knows the difference between a duck.Haudek knows the difference between a duck.Haudek knows the difference between a duck.
 
Posts: 23
Karma: 111614
Join Date: Mar 2025
Location: Poland
Device: Kindle Voyage
Foreign Words Plugin for Sigil – Short Description

Overview
The Foreign Words plugin for Sigil automatically identifies and marks foreign words or phrases within EPUB files. It uses user-defined dictionaries to detect foreign expressions and applies customizable HTML markup (<span> tags) to these expressions, facilitating consistent styling and language identification.

Main Features
  • Automatic Detection: Scans EPUB content for words or phrases defined in user dictionaries
    OR
    Search for existing: Searches the document for all <span> tags with the specified language in the lang or xml:lang attribute.
  • Customizable Markup: Wraps detected expressions in <span> tags with optional CSS classes and language attributes.
  • Phrase Combination: Optionally combines adjacent foreign words into phrases.
  • Span Merging: Optionally merges adjacent <span> tags with identical attributes to optimize markup structure.

Configuration Options
The plugin provides a graphical interface allowing users to configure plugin options easily.
The interface dynamically updates a preview of the resulting <span> markup based on current settings.

Users can customize the following settings:


Use Dictionary File
If the option is enabled, words/phrases are searched based on the user's dictionary, and if disabled, we look for existing <span> tags in the document according to the selected language through the drop-down list "Look For This Language Code"

User Dictionary File
Select the dictionary file containing foreign words (only filenames without extension)

Ignore CSS Class
You can ignore existing classes in <span> tags

Use CSS Class
Enable/disable adding a CSS class to marked expressions

CSS Class Name
Name of the CSS class applied to marked expressions

Use Lang Attribute
Enable/disable adding lang attribute

Use xml:lang Attribute
Enable/disable adding xml:lang attribute

Language Code
Language code used in attributes (lang and xml:lang)

Combine Words
Combine adjacent foreign words into phrases

Merge Spans
Merge adjacent <span> tags with identical attributes

Run Without Confirmation
Execute plugin without confirmation dialogs

Debug Mode
Enable detailed logging for troubleshooting. Useful information, but more for developers.

What exactly does the plugin do?
  • All detected foreign words/phrases are wrapped in <span> tags.
  • Tags include optional CSS class (class="english" by default).
  • Tags include language attributes (lang="en" and/or xml:lang="en").
  • Adjacent marked words can be combined into single phrases.
  • Adjacent spans can be merged if enabled.
Example output:
Code:
<span class="english" lang="en" xml:lang="en">foreign phrase</span>
Description updated: April 3, 2025
Applies to plugin version: 0.0.107 (beta6)

Last edited by Haudek; 04-03-2025 at 05:40 AM. Reason: Editing the description
Haudek is offline   Reply With Quote
Old 03-16-2025, 05:44 AM   #3
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 8,404
Karma: 5702578
Join Date: Nov 2009
Device: many
Thank you for contributing your plugin to help others! I am travelling with only a phone now but upon my return, I will happily test and report back.
KevinH is offline   Reply With Quote
Old 03-16-2025, 12:15 PM   #4
Haudek
Member
Haudek knows the difference between a duck.Haudek knows the difference between a duck.Haudek knows the difference between a duck.Haudek knows the difference between a duck.Haudek knows the difference between a duck.Haudek knows the difference between a duck.Haudek knows the difference between a duck.Haudek knows the difference between a duck.Haudek knows the difference between a duck.Haudek knows the difference between a duck.Haudek knows the difference between a duck.
 
Posts: 23
Karma: 111614
Join Date: Mar 2025
Location: Poland
Device: Kindle Voyage
Sure.

I got a sample file and I already know that there are situations in which the plugin will not be very useful. It is about the situation where we want to mark languages with the same words ("la" in Italian and French).

It is most useful when the whole book is in one language, but has phrases in another, for example, one of the characters keeps inserting some German or French sentences.

Nevertheless, checking that the attributes are added correctly is necessary.
Checkpoints will be perfect for checking the differences.
Haudek is offline   Reply With Quote
Old 03-16-2025, 04:16 PM   #5
Haudek
Member
Haudek knows the difference between a duck.Haudek knows the difference between a duck.Haudek knows the difference between a duck.Haudek knows the difference between a duck.Haudek knows the difference between a duck.Haudek knows the difference between a duck.Haudek knows the difference between a duck.Haudek knows the difference between a duck.Haudek knows the difference between a duck.Haudek knows the difference between a duck.Haudek knows the difference between a duck.
 
Posts: 23
Karma: 111614
Join Date: Mar 2025
Location: Poland
Device: Kindle Voyage
Sorry for the post under post, but I can't edit them yet (I'm fresh on MR).

I did quite a bit of experimentation today and it turns out that languages very different from English are supported by the plugin.
That is, if the main text is in English, but contains inserts in Russian, Georgian, Hebrew, Arabic, Korean, Chinese, or Japanese it is relatively easy to detect them in the SpellCheck window and add them to the corresponding user dictionary.
The plugin will take care of the rest.

The problem, then, is more when the two languages share a certain word base, because the plugin is unable to recognize whether the word in question is in the publication's default language or in a language where an identical word exists.
Haudek is offline   Reply With Quote
Old 03-19-2025, 05:48 PM   #6
Haudek
Member
Haudek knows the difference between a duck.Haudek knows the difference between a duck.Haudek knows the difference between a duck.Haudek knows the difference between a duck.Haudek knows the difference between a duck.Haudek knows the difference between a duck.Haudek knows the difference between a duck.Haudek knows the difference between a duck.Haudek knows the difference between a duck.Haudek knows the difference between a duck.Haudek knows the difference between a duck.
 
Posts: 23
Karma: 111614
Join Date: Mar 2025
Location: Poland
Device: Kindle Voyage
Version 0.0.82 (Beta 2).
Users are invited to test the Beta2 version.

Last edited by Haudek; 03-19-2025 at 08:15 PM.
Haudek is offline   Reply With Quote
Old 03-20-2025, 04:11 PM   #7
Haudek
Member
Haudek knows the difference between a duck.Haudek knows the difference between a duck.Haudek knows the difference between a duck.Haudek knows the difference between a duck.Haudek knows the difference between a duck.Haudek knows the difference between a duck.Haudek knows the difference between a duck.Haudek knows the difference between a duck.Haudek knows the difference between a duck.Haudek knows the difference between a duck.Haudek knows the difference between a duck.
 
Posts: 23
Karma: 111614
Join Date: Mar 2025
Location: Poland
Device: Kindle Voyage
Version 0.0.83 (Beta 3).
See the second post of this thread for a description of the plugin.
Haudek is offline   Reply With Quote
Old 03-21-2025, 06:34 AM   #8
philja
Addict
philja began at the beginning.
 
Posts: 206
Karma: 10
Join Date: Nov 2015
Location: Europe EEC
Device: Kindle Fire HD6 & HD8
I've just given your plugin a test drive on one xhtml file which has some parts in French. Plugin runner did not give a clear indication that the job was done. At the top it showed Status: success but the last entry in the main part of the window said "INFO: Saving file ...."

I waited a bit expecting the runner to close but it didn't.

The plugin appeared to work as designed but the result was a solid block of HTML in code view - very difficult to read with no white space.

Before running the plugin, Sigil had prettified Code View but the plugin undid that.

All the <spans> were not merged but some were so the plugin did work as scheduled. Some words were not recognised as French. The problem undoubtedly lies with my dictionaries.

I confess that my treatment of spellcheck over the years has not been ideal and my dictionaries are probably in a deplorable state. I write non-fiction so I have a load of jargon words and abbreviations particular to the subject matter. That has lead me to ignore or mistreat the spellcheck over the years.

Before trying again, I think I'll have to 'fix' my dictionaries.
philja is offline   Reply With Quote
Old 03-21-2025, 08:37 AM   #9
DiapDealer
Grand Sorcerer
DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.
 
DiapDealer's Avatar
 
Posts: 28,308
Karma: 203719142
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
Quote:
Originally Posted by philja View Post
I waited a bit expecting the runner to close but it didn't.
They're probably just leaving the Runner visible during the beta to make it easier to see/report debug info. Unless the autoclose value is set to true in a plugin's plugin.xml file, they will all behave that way. We didn't even have an autostarting/autoclosing Plugin Runner feature in the early plugin days. Had to start them with a button and close them with a button.
DiapDealer is offline   Reply With Quote
Old 03-21-2025, 10:51 AM   #10
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 8,404
Karma: 5702578
Join Date: Nov 2009
Device: many
And if anyone is interested, we have added an xhtml pretty routine to our sigil_bs4 fork.

Code:
   def prettyprint_xhtml(self, indent_level=0, eventual_encoding=DEFAULT_OUTPUT_ENCODING,
                          formatter="minimal", indent_chars=" ")
So it would be relatively easy to "pretty print" serialize a xhtml file once parsed in sigil_bs4 if desired in the plugin. Otherwise runnng prettify inside Sigil globally is easy too after the plugin completes (or use automate).
KevinH is offline   Reply With Quote
Old 03-21-2025, 12:39 PM   #11
philja
Addict
philja began at the beginning.
 
Posts: 206
Karma: 10
Join Date: Nov 2015
Location: Europe EEC
Device: Kindle Fire HD6 & HD8
Quote:
Originally Posted by KevinH View Post
... Otherwise runnng prettify inside Sigil globally is easy too after the plugin completes (or use automate).
True.

I tried to clean up my default and french dictionaries. I ran spellcheck and in the french half of the xhtml file, every word underlined by the spellcheck I added to the french user dictionary. When spell check failed to underline any word, I looked in the word list for the french user dictionary (Preferences) and the words appeared there ok.

Then I ran the plugin but it failed to merge all the <spans>. Its merge action gets interrupted by four types of occurrence (so far):
1. the occasional word which is not identified as french and not identified as misspelt by spellchecker
2. words which are the same spelling in en and fr (although the meaning is different). This includes number digits.
3. words within <i> tags
4. words within <b> tags.

There's not much can be done for words which might be in either language but it takes too much cleaning up for variants such as plurals, words starting with an upper case letter, occasions where a letter is replaced with ' such as d'exercices.

For my user case, where I'm trying to improve usability with conversion to epub3, it's much simpler to wrap the entire half-chapter into a <section> with language specifiers included in the section tag.
philja is offline   Reply With Quote
Old 03-23-2025, 05:20 AM   #12
philja
Addict
philja began at the beginning.
 
Posts: 206
Karma: 10
Join Date: Nov 2015
Location: Europe EEC
Device: Kindle Fire HD6 & HD8
Although I said previously that my treatment of spellcheck had been poor over the years, I did not realise how deficient I was. I found an interesting and informative thread about spellcheck which talked about scowls etc.

I persisted in trying to get the plugin to complete the <span> merges but ended in thinking that somehow it was prevented by the spellcheck system.

Whenever I invoke spellcheck in Sigil, it produces a list of words which it doubts but which it lists as EN (probably because the base language of the epub is listed as EN). I go through this list and add the words which are in fact FR to my FR user dictionary and those which are EN to my EN user dictionary.

I save everything, but the next invocation of spellcheck produces the same list of words. I check in Sigil's preferences for the contents of the user dictionaries and the words were in fact present there. The 'add to dictionary' had worked.

When I invoke spellcheck and click on a French word in its list, it takes me to the instance in code view and that is inevitably in a break between the plugins <span> merges.

It looks like spellcheck ignores the user dictionaries and this is one of the causes of the plugin's failures to merge all spans.
philja is offline   Reply With Quote
Old 03-23-2025, 09:06 AM   #13
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 8,404
Karma: 5702578
Join Date: Nov 2009
Device: many
No spellcheck does not ignore user dictionaries that are properly specified but they do follow and use the lang and xml:lang attributes religiously to determine which dictionary to spellcheck a potential word in.

But in this plugin's case, there are no xml:lang or lang attributes on span tags on foreign words so it will default to either the xml:lang or lang attributes on the html tag or if those are not present, the first dc:language metadata tag in the opf.

So it is a bit of a catch-22 here. You need the lang attributes to know what dictionary to look a word up in but you need spellcheck to determine if a word is potentially a foreign word or just an incorrectly spelled one.

Detecting single possible foreign words is possible with only one dictionary and you can generate a word list of probable foreign words. But until spans are added and merged and xml:lang/lang attributes for the proper foreign language added, there is no way to tell which dictionary to look up a word in.

I think the best approach is multi-pass. First to use a single dictionary to create a list of single foreign words, wrap those in span tags and add proper lang and xml:lang attributes to each.

Next pass is to use the find foreign words variant of this plugin, or just search for xml:lang and visually/manually fix any intervening adjacent words that were not properly detected to add the proper span and lang info.

Final pass, merge adjacent spans with matching xml:lang attributes.

Plugins can in fact use hunspell spellchecking directly, and you could inside a plugin force lookup a word to see if it exists in both dictionaries and if adjacent to an existing span, add it.

Then the entire process would be done inside the plugin.

Last edited by KevinH; 03-23-2025 at 09:12 AM.
KevinH is offline   Reply With Quote
Old 03-23-2025, 05:39 PM   #14
Haudek
Member
Haudek knows the difference between a duck.Haudek knows the difference between a duck.Haudek knows the difference between a duck.Haudek knows the difference between a duck.Haudek knows the difference between a duck.Haudek knows the difference between a duck.Haudek knows the difference between a duck.Haudek knows the difference between a duck.Haudek knows the difference between a duck.Haudek knows the difference between a duck.Haudek knows the difference between a duck.
 
Posts: 23
Karma: 111614
Join Date: Mar 2025
Location: Poland
Device: Kindle Voyage
Thank you for your comments, most of which I obviously agree with.
All cases described are familiar to me and it is supposed to work just like that.
Of course, I still have some ideas in my head that may (or may not) improve the performance of the merging function.

As for the lack of feedback to the user "what has been done" I have already changed that and in beta4 (0.0.94) it should be better.

The ideas submitted are very interesting and I will certainly consider some of them.

Last edited by Haudek; 03-23-2025 at 05:42 PM.
Haudek is offline   Reply With Quote
Reply

Tags
lang, language code, plugin, sigil, span

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
import changes foreign language to nonsense words dandman Library Management 9 05-15-2024 07:01 PM
Buying books: words or phrases that can turn a 'yes' into a 'no'? ZodWallop General Discussions 133 09-13-2020 11:25 AM
Writing Phrases and Words I Detest Dr. Drib Writers' Corner 98 10-12-2017 10:24 AM
One-touch look-up of words in foreign language book andrewkirk Kobo Reader 8 06-09-2015 08:20 AM
Common words/phrases too aggressively italicized. carnivore Conversion 2 02-11-2011 06:36 PM


All times are GMT -4. The time now is 03:30 AM.


MobileRead.com is a privately owned, operated and funded community.