08-02-2020, 01:26 PM | #31 |
Groupie
Posts: 171
Karma: 40000
Join Date: Oct 2013
Device: kindle
|
Someone told me to plead my case here
DISCLAIMER: Haven't read the whole thread. I often edit ebooks which feature many foreign words, whose (the words') language is not marked in the code (mostly because said ebooks come from OCR, and also because those words are scattered throughout the book in a very unorderly fashion). I would really love to be able to spellcheck against the sum of varying numbers of dictionaries (i.e. having the spellcheck only list the words that are not present in any of those dictionaries. As things stand now, when correcting a long book (typically university textbooks on humanistic subjects), I find myself scrolling through a list of thousands of words, many of which are in some language other than the one the text is actually written in, and 99% of which are false positives. |
08-02-2020, 07:53 PM | #32 | |
Wizard
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
Post #11 in "Export list of words in spellcheck" which also points to how I use (Calibre's) Spellcheck Lists + Regex: Post #29 in "Is there a way to use the selection in a Saved Search?" I've used that method successfully on journal articles + text from game files (millions of words). For one game, I even hackishly assigned each character different langs, then used Calibre to give me a breakdown of all words spoken per character. This allowed me to normalize the translation. (For example, one character always said "dinnae" instead of "didn't". The word list method made sure to catch any strays. ) For games, it also allowed me to easily catch any made-up fantasy words very easily, since they didn't appear in either the US or UK dictionaries. Last edited by Tex2002ans; 08-02-2020 at 07:57 PM. |
|
08-05-2020, 03:13 PM | #33 |
Sigil Developer
Posts: 7,469
Karma: 5432724
Join Date: Nov 2009
Device: many
|
Okay, just as a proof of concept, I have taken Doitsu's foreign word plugin (Thank you Doitsu!) and written an HTMLLangTextParser class (htmllangtextparser.py) (based on my quickparser - thank you varlog!) and created a text word parser (textwordparser.py) that tokenizes text into words using the current language dictionaries WORDCHARS, just like Sigil does now) to create a SpellMLDemo validation plugin.
NOTE: THIS PLUGIN ONLY WORKS FOR BUILDS FROM SIGIL MASTER AS OF TODAY If this seems to work, then I will use it as a model for some of the C++ code inside Sigil itself but replace the HTMLLangTextParser with something based on GumboParser's Node tree (ie. a DOM) based real html parser approach which will be more robust to parsing errors and well-formed errors. I will extract the text parsing code which uses dictionary wordchars into its own class, and change the current SpellCheck.cpp class to have multiple dictionaries open at the same time. I have attached it in case anyone is interested in testing it or simply looking at the code. KevinH ps: I added a new plugin hunspell interface that handles multiple dictionaries in a much smarter way. See the attached pluginhunspellml.py if you are interested in plugin based spellchecking. Note, pluginhunspellml.py will be an additional plugin interface for hunspell. It will augment, not replace the current pluginhunspell.py so that full backwards compatibility is maintained for plugins that use the now older plugin interface. Last edited by KevinH; 08-05-2020 at 03:26 PM. Reason: added a new plugin hunspell interface that handles multiple languages in a much better fashion |
08-05-2020, 03:39 PM | #34 |
Grand Sorcerer
Posts: 5,582
Karma: 22735033
Join Date: Dec 2010
Device: Kindle PW2
|
I quickly tested the plugin with the latest Windows AppVeyor build and it appears to be working fine. The only issue that I noticed is that the offset is off and increases with each occurrence.
Most likely this because in the Windows editor, lines are terminated by \r\n vs \n in the macOS/Linux versions. |
08-05-2020, 04:11 PM | #35 |
Sigil Developer
Posts: 7,469
Karma: 5432724
Join Date: Nov 2009
Device: many
|
Yes, I will need to go add back to line and col info and add the offset corrections for windows as only on windows is the file stored with \r\n. I was getting the offset directly from the data file. So we must be converting \r\n to \n someplace in the plugin readfile routine.
Nice catch! Addendum: Actually Sigil always strips out the \r\n line endings before saving the file, so by the time the plugin gets a hold of it the file has had the line endings converted to just \n They get added back when reading in the file on Sigil on Windows. Last edited by KevinH; 08-05-2020 at 04:32 PM. |
08-05-2020, 04:23 PM | #36 |
Sigil Developer
Posts: 7,469
Karma: 5432724
Join Date: Nov 2009
Device: many
|
Hi Doitsu,
Would it work to put back the proper windows line endings before parsing the file? Does this change to plugin.py work? Code:
# process file list for (man_id, href) in file_list: bookpath = bk.id_to_bookpath(man_id) print('Processing ', bookpath, '...') data = bk.readfile(man_id) # add back in proper windows line ends if sys.platform.startswith('win'): data = data.replace("\n", "\r\n") |
08-05-2020, 05:27 PM | #37 | |
Grand Sorcerer
Posts: 5,582
Karma: 22735033
Join Date: Dec 2010
Device: Kindle PW2
|
Quote:
IMHO, it doesn't really matter, because the actual implementation will be in C++ and not as a validation plugin and the most important part of the code--multi-language spell checking--works as designed. |
|
08-05-2020, 06:57 PM | #38 |
Sigil Developer
Posts: 7,469
Karma: 5432724
Join Date: Nov 2009
Device: many
|
Okay, I went the wrong way. QIODevice when set to Text will always write the platform appropriate line endings. We use a QTextStream with a QIODevice set to Text in our Utility:WriteUnicodeTextFile.
So on Windows, in order to make offsets that work internally to Qt, I must replace all \r\n with just \n before calling parse. I still do not know if automatic line conversions are done when unpacking the zip (.epub) depending on platform. I will look to see how minizip and zlib handles that. This is important for Checkpointing and diffs as well. |
08-13-2020, 12:04 PM | #39 |
Sigil Developer
Posts: 7,469
Karma: 5432724
Join Date: Nov 2009
Device: many
|
What do people think of the following as a multi-language interface to SpellCheck?
Pluses -------- - it can be filtered very easiiy by language. - every "word" that gets passed around has its language code prepended by the parser - a user can easily create a language specific word to update or change or find by using the prefix - in other words ... en: baden (a spelling mistake) is considered to be different from de: baden (also a spelling mistake in German as it should be captialized) and you can find and update each one separately. - language code prefixes could be suppressed when all words come from a single primary language (so that it would look exactly as it does now for single language ebooks being spellchecked. Please check out the screenshots attached (ignore the inline squiggley bits for now) and let me know what you think about this type of approach to support multi-language spell checking in a gui. |
08-13-2020, 12:14 PM | #40 | |
Bibliophagist
Posts: 34,236
Karma: 144198474
Join Date: Jul 2010
Location: Vancouver
Device: Kobo Sage, Forma, Clara HD, Lenovo M8 FHD, Paperwhite 4, Tolino epos
|
Quote:
I wouldn't want to take the bet that someone is going to complain that they should not have to use language tags. |
|
08-13-2020, 12:31 PM | #41 |
Sigil Developer
Posts: 7,469
Karma: 5432724
Join Date: Nov 2009
Device: many
|
Luckily if a word does not have a language code prefix, I can assume it is in the book's primary language and automagically prepend a language code.
Another plus of this approach is the the users dictionary and Ignore lists will now work across multiple languages and do the right thing. |
08-13-2020, 02:03 PM | #42 | ||
Grand Sorcerer
Posts: 5,582
Karma: 22735033
Join Date: Dec 2010
Device: Kindle PW2
|
Quote:
Quote:
Ich = I is only a spelling mistake, if it's not the first word of a sentence. im = in the and ist = is are not spelling mistakes, unless they're the first word of a sentence. However, AFAIK, Hunspell doesn't take punctuation into account and will ignore capitalization, unless a word only has a capitalized entry. You can use the following simple German sentences without spelling mistakes as test cases: Code:
<span xml:lang="de" lang="de">Morgen gehe ich im See baden.</span> <span xml:lang="de" lang="de">Ich gehe morgen im See baden.</span> <span xml:lang="de" lang="de">Das Baden ist hier verboten.</span> |
||
08-13-2020, 02:13 PM | #43 |
Sigil Developer
Posts: 7,469
Karma: 5432724
Join Date: Nov 2009
Device: many
|
There is no new version. This just to see if the language parser I wrote in Cpp works and to get feedback on the potential gui.
In other words ..Are the language code prefixes correct given the lang attributes. That was my main concern that I was trying to test. The only spelling dictionary being used here is the English one as I have not modified SpellCheck to handle multiple dictionaries yet. That is still coming. Last edited by KevinH; 08-13-2020 at 02:15 PM. |
08-13-2020, 03:12 PM | #44 | |||
Wizard
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
It would also make the most important column too busy and hard to read. One of the key advantages of Spellcheck Lists is to rapidly skim words at a glance. And why not a Language column, similar to Calibre? Along with human-readable language names: Proposed: Code:
Word | Count | Misspelled ______________|________|_____________ en-us: word | 1 | No de: word | 1 | Yes Code:
Word | Count | Language | Misspelled ______________|________|______________|___________ word | 1 | English (US) | No word | 1 | German | Yes Code:
Word | Count | Misspelled ______________|________|_____________ en-us: word | 1 | No de: word | 1 | Yes Quote:
Again, looking at how Calibre does it: If you search, you can already sort by the Language column. Search: the Method 1. Sort by Word: Spoiler:
Helpful for catching bad lang markup, OCR errors, etc. Method 2. Sort by Language: Spoiler:
Easily split all English or German words. Method 2B. Sort Language again: Spoiler:
(Maybe a search for "de: the" will only display the German one. But when you can sort by Language... you're one/two clicks away from achieving similar results.) Method 3. Uncheck the "Show All Words" box: Spoiler:
See all misspelled words. Can Ignore or Add to Dictionary. Quote:
If you pick: en: baden "Add to Dictionary" = Default or English. If you pick: de: baden "Add to Dictionary" = Default or German. * * * What about Edit > Preferences > Spellcheck Dictionaries? Will this get a "Language" column as well? Or some way to add words on a per language basis? (Selecting a language in "Dictionary" dropdown will update the User Dictionary Word List?) (I personally don't add words to my user dictionaries, so maybe someone else might have more insights here.) |
|||
08-13-2020, 03:26 PM | #45 |
Sigil Developer
Posts: 7,469
Karma: 5432724
Join Date: Nov 2009
Device: many
|
If Users think language codes are too technical, then how on earth would they have any idea about adding lang or xml:lang attributes?
Ditto for dc:language? At some point, you have to assume some technical knowledge by the user, don't you? As for alignment, as the image shows, it is sorted by lang code first then word as although the sequence of chars may be identical, they are truly different words if in different languages. So no whitespace alignment needed. And I really also have no interest is copying gui elements from Calibre. I am trying to find something simpler and unique, not just duplicate what anyone else has done. Last edited by KevinH; 08-13-2020 at 04:14 PM. |
Thread Tools | Search this Thread |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Search in 2 dictionary in a same time | akorx | Kobo Reader | 3 | 03-06-2020 09:30 AM |
Bug in dictionary function, wish: upgrade to using multiple dictionaries at one time | Bjarne | Calibre | 1 | 04-21-2019 05:13 AM |
So I tried to use the dictionary on my PB360 for the first time... | maxbookworm | PocketBook | 18 | 06-27-2010 08:29 PM |
Dictionary lookup time | tompe | Bookeen | 17 | 11-08-2008 12:19 PM |