View Single Post
Old 01-24-2020, 08:16 AM   #21
geek1011
Wizard
geek1011 ought to be getting tired of karma fortunes by now.geek1011 ought to be getting tired of karma fortunes by now.geek1011 ought to be getting tired of karma fortunes by now.geek1011 ought to be getting tired of karma fortunes by now.geek1011 ought to be getting tired of karma fortunes by now.geek1011 ought to be getting tired of karma fortunes by now.geek1011 ought to be getting tired of karma fortunes by now.geek1011 ought to be getting tired of karma fortunes by now.geek1011 ought to be getting tired of karma fortunes by now.geek1011 ought to be getting tired of karma fortunes by now.geek1011 ought to be getting tired of karma fortunes by now.
 
Posts: 2,808
Karma: 7423683
Join Date: May 2016
Location: Ontario, Canada
Device: Kobo Mini, Aura Edition 2 v1, Clara HD
Quote:
Originally Posted by rtiangha View Post
OK, my Python is weak, but it would be nice to fix this if possible, especially if more is known about the file format now than when penelope was originally developed. So while I wait for your notes, I'm going to try to reason this out without trying to code trace everything (which is something I don't really want to do, lol) and maybe even someone with the skills might be able to fix things if I don't get around to it or can't figure it out in time.



Definitely prefix_kobo.py.



I saw your reply to skybook in the other thread. So is it just enough to change this portion:

Code:
        if is_ok:
            prefix = headword[0:length]
to this:

Code:
        if is_ok:
            prefix = headword[0:2]
Or is there more to it?



So clearly, this part:

Code:
    def is_allowed(character):
        # all non-ascii (x > 127) are ok
        # all ASCII lowercase letters (97 <= x <= 122) are ok
        # everything else is not ok
        try:
            code = ord(character)
            return (code > 127) or ((code >= 97) and (code <= 122))
        except:
            pass
        return True
needs to be rewritten. I think x > 127 was intended to take care of things like Japanese, Chinese, Arabic, etc. characters, but I'm guessing that there's stuff in there that needs to be ignored too (like emojis and other symbols)? Or are there other things to be concerned about?



That part I'm not sure about, but is it related to the



issue in the other thread? As in (if I'm reading the code right; remember, weak in Python) the entry gets treated as a special case because of the space (because SPACE has a ASCII/Unicode value of 64 which is outside of the range that the is_allowed() function is looking at) and so replacing all spaces with 'a' (or really, any character in the allowable range?) is needed so the code as currently written treats it as a headword with no spaces (which it is able to process correctly) instead? Or did I get that completely wrong?
Yes, [:2] is correct.

And to check if it is a Unicode letter, use isalpha.

No, that last part is incorrect, as only the first two characters are considered. But, something like "a" would to into "aa.html".

You might also need to change the order of some of the checks to match. I'll give more details once I finish dictutil.

Quote:
Also, does dictutil handle Kobo dictionary synonyms properly? Penelope doesn't even touch it, and stuff like the Japanese dictionaries rely heavily on synonyms because the language uses three alphabets (Kanji, hiragana and katakana) and a complex word can be spelt with any combination (ex. All kanji, all hiragana, all katakana, a mix of kanji and hiragana, or even in Latin letters via romaji) and so it leans heavily on synonyms so that there aren't separate/duplicate entries for each different way to spell a word.
Yep, that's one of the best parts! So, there are a few things I figured out. Firstly, if more than one entry matches, Kobo will merge it in the order it appears in the *same* file (i.e. not files for other prefixes) (I need to check this further). Secondly, variants MUST be trimmed and lowercased in the HTML files, regardless of the casing in the index or in the actual word. This is because unlike usual headword matches (original, lowercased, uppercased, and lowercased with first letter uppercased; all by prefix), variants are only matched against the lowercased version. Thirdly, the entire entry for variants must be duplicates into all matching prefixes (this is a bug in the official dictionaries too).[/QUOTE]

Quote:
While many of the stock Kobo dictionaries are encrypted, the Japanese ones currently aren't, if you'd like to take a peek at how they use synonyms (I actually managed to convert the Progressive EN-JA kobo dictionary to Stardict XML format, which considering I don't know regular expressions or XSLT at all (I did it with a lot of find/replace in Notepad++, lol), I was amazed that it even worked and immediately sought to merge it with entries from the open-source JMDict project, albeit an older version (using --flatten-synonyms, of course); in fact, part of my struggle was in creating an updated JMDict version with 2019 data because my Kobo wouldn't recognize what I made no matter what I did, which in hindsight, might probably be due to " " appearing in some headwords. So yeah, it'd be nice to fix penelope if possible).
I've checked against all the Kobo dictionaries, but don't ask me how. It's not really feasible to add synonyms to Penelope without either hacky stuff (putting them into the definition field), or a major restructure.
geek1011 is offline   Reply With Quote