View Single Post
Old 10-27-2019, 04:51 PM   #9
rtiangha
Evangelist
rtiangha ought to be getting tired of karma fortunes by now.rtiangha ought to be getting tired of karma fortunes by now.rtiangha ought to be getting tired of karma fortunes by now.rtiangha ought to be getting tired of karma fortunes by now.rtiangha ought to be getting tired of karma fortunes by now.rtiangha ought to be getting tired of karma fortunes by now.rtiangha ought to be getting tired of karma fortunes by now.rtiangha ought to be getting tired of karma fortunes by now.rtiangha ought to be getting tired of karma fortunes by now.rtiangha ought to be getting tired of karma fortunes by now.rtiangha ought to be getting tired of karma fortunes by now.
 
Posts: 495
Karma: 356531
Join Date: Jul 2016
Location: 'burta, Canada
Device: Kobo Glo HD
I hate to raise a thread from the dead, but I think it should be possible to do this in Penelope. However, I don’t know Python so I’m hoping someone out there can as I don’t think it would take much modification to add this functionality in (basically alter two lines of code).

Currently, Penelope does have a function to read in a zipped Kobo file (so you could pass 'kobo' as an option for the -i switch right now even though it isn't in the documentation), but it only reads in the index because “The read function only acquires the index, as the definition files of the original Kobo dictionaries are obfuscated/encrypted.”

Which is why the read loop explicitly passes an empty string rather than a definition:

Code:
            for pair in trie.items():
                dictionary.add_entry(headword=pair[0], definition=u"")
However, we now know that the entries aren’t encrypted; they’re just gzipped (or at least, that's the case for some of the dictionaries; I haven't tried every single one...yet). You can verify this for yourself by taking any of the .html files, renaming them to .html.gz, run gunzip on them, and the resultant .html file is completely readable!

EDIT: Maybe I spoke too soon. It looks like SOME dictionaries may be encrypted and some may not (my OCD may drive me to make a list when I have time). I might take a look at extending Penelope to be able to process unencrypted dictionaries anyways, because why not? Still need to figure out how Penelope works and learn enough Python/Marisa to figure out where and what to change though.

Anyway, I think if we can extract the definition and input that instead of that empty string, then Penelope should work like it does for the other formats with unencrypted Kobo dictionaries (and I can't tell if Penelope or Marisa gunzips or even opens any of those html files in the first place; if not, that functionality would need to be coded in too).

Assuming the gzip thing isn't an issue, that's where I'm stuck, though. I would have assumed that pair[1] would hold the definition, but it instead holds a number (then again, I have no idea how tries work and I find the Marisa tutorial somewhat lacking for my level of understanding). I don't know what to do with that number to extract the definition (use it to look something up in another array maybe?). I did confirm that it'll spit out whatever string you place there into the html file underneath the headword, so clearly, that's where one would put the extracted definition. The validation test would be to run something like this:

Code:
penelope -i dicthtml-en-ja.zip -j kobo -f ja -t en -p kobo -o dicthtml-ja-en
and the resultant dicthtml-ja-en.zip file would be exactly the same as the original dicthtml-en-ja.zip file.

Last edited by rtiangha; 10-27-2019 at 11:42 PM.
rtiangha is offline   Reply With Quote