MobileRead Forums - View Single Post

rtiangha · 01-24-2020, 05:01 AM

Quote:

Originally Posted by geek1011

It's a problem in the code itself (I figured out how it works the same way I figure out patches (i.e. disassembly), unlike how the dictionary stuff was original done many years ago).

OK, my Python is weak, but it would be nice to fix this if possible, especially if more is known about the file format now than when penelope was originally developed. So while I wait for your notes, I'm going to try to reason this out without trying to code trace everything (which is something I don't really want to do, lol) and maybe even someone with the skills might be able to fix things if I don't get around to it or can't figure it out in time.

Quote:

To fix it, you need to change kobo_prefix.py (or prefix_kobo.py, I don't remember which)

Definitely prefix_kobo.py.

Quote:

to consider only the first 2 chars when determining whether to use 11.html,

I saw your reply to skybook in the other thread. So is it just enough to change this portion:

Code:

        if is_ok:
            prefix = headword[0:length]

to this:

Code:

        if is_ok:
            prefix = headword[0:2]

Or is there more to it?

Quote:

you need to consider if a character is in the Unicode letters class rather than just ASCII,

So clearly, this part:

Code:

    def is_allowed(character):
        # all non-ascii (x > 127) are ok
        # all ASCII lowercase letters (97 <= x <= 122) are ok
        # everything else is not ok
        try:
            code = ord(character)
            return (code > 127) or ((code >= 97) and (code <= 122))
        except:
            pass
        return True

needs to be rewritten. I think x > 127 was intended to take care of things like Japanese, Chinese, Arabic, etc. characters, but I'm guessing that there's stuff in there that needs to be ignored too (like emojis and other symbols)? Or are there other things to be concerned about?

Quote:

and you might need to add a .replace(" ", "a")..

That part I'm not sure about, but is it related to the

Quote:

For entries like "copper noble", those should have gone into files like "co.html" instead of "11.html", but that's a bug in Penelope.

issue in the other thread? As in (if I'm reading the code right; remember, weak in Python) the entry gets treated as a special case because of the space (because SPACE has a ASCII/Unicode value of 64 which is outside of the range that the is_allowed() function is looking at) and so replacing all spaces with 'a' (or really, any character in the allowable range?) is needed so the code as currently written treats it as a headword with no spaces (which it is able to process correctly) instead? Or did I get that completely wrong?

Also, does dictutil handle Kobo dictionary synonyms properly? Penelope doesn't even touch it, and stuff like the Japanese dictionaries rely heavily on synonyms because the language uses three alphabets (Kanji, hiragana and katakana) and a complex word can be spelt with any combination (ex. All kanji, all hiragana, all katakana, a mix of kanji and hiragana, or even in Latin letters via romaji) and so it leans heavily on synonyms so that there aren't separate/duplicate entries for each different way to spell a word.

While many of the stock Kobo dictionaries are encrypted, the Japanese ones currently aren't, if you'd like to take a peek at how they use synonyms (I actually managed to convert the Progressive EN-JA kobo dictionary to Stardict XML format, which considering I don't know regular expressions or XSLT at all (I did it with a lot of find/replace in Notepad++, lol), I was amazed that it even worked and immediately sought to merge it with entries from the open-source JMDict project, albeit an older version (using --flatten-synonyms, of course); in fact, part of my struggle was in creating an updated JMDict version with 2019 data because my Kobo wouldn't recognize what I made no matter what I did, which in hindsight, might probably be due to " " appearing in some headwords. So yeah, it'd be nice to fix penelope if possible).