![]() |
#1 |
Giant Hobbit
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 49
Karma: 487552
Join Date: Aug 2009
Location: Turkey
Device: Kobo: Clara, Mini, Aura HD, Aura 2, Kindle: Paperwhite 1, DX 1
|
Merge .zip Dictionaries
I have followed a guide I found in these forums, and I have successfully created a custom en->tr translation dictionary. It works fine. However, I want to merge this dictionary (let's say its name is dicthtml-eng-tur.zip) with Kobo's official en->en dictionary (copied from the device, named dicthtml.zip), and possibly with 1 or 2 additions like WordNet or something.
I have seen how you can do it before you create the dictionaries, using penelope again. But now I have the .zip files and I have seen no example usage of how you merge .zip dictionaries! Can someone help? |
![]() |
![]() |
![]() |
#2 |
Zealot
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 105
Karma: 5885446
Join Date: Feb 2014
Device: Kobo Glo
|
Correct me if I'm wrong, but isn't the Kobo's official dictionary encrypted?
|
![]() |
![]() |
Advert | |
|
![]() |
#3 |
Giant Hobbit
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 49
Karma: 487552
Join Date: Aug 2009
Location: Turkey
Device: Kobo: Clara, Mini, Aura HD, Aura 2, Kindle: Paperwhite 1, DX 1
|
Is it? I don't know much, sorry. Then is it possible to merge the en->tr dictionary with WordNet or other (unencrypted) dictionary and replace kobo's built-in one?
Last edited by Majorix; 04-02-2014 at 02:36 PM. |
![]() |
![]() |
![]() |
#4 |
Giant Hobbit
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 49
Karma: 487552
Join Date: Aug 2009
Location: Turkey
Device: Kobo: Clara, Mini, Aura HD, Aura 2, Kindle: Paperwhite 1, DX 1
|
Guys, nobody has tried this yet?
|
![]() |
![]() |
![]() |
#5 |
Tenrec
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 724
Karma: 1076988
Join Date: Oct 2012
Device: Kobo Aura One, Kobo Glo
|
Someone did this for me with Japanese and Japanese-English, dictionaries ... One being a kobo provided dictionary, the other created by a user on this forum. Memory says it was tshering... No idea how he did it though. So should be possible...
edit: just checked my PMs, and it was indeed tshering who did this for me, try PMing him and asking how he did it! |
![]() |
![]() |
Advert | |
|
![]() |
#6 | ||
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 3,489
Karma: 2914715
Join Date: Jun 2012
Device: kobo touch
|
Quote:
Majorix, since you use penelope, the best way for you would be as follows (this is how I understand it from the penelope homepage): Bring all dictionaries you want to combine into one and the same format that penelope understands, and then merge them with penelope. Quote:
Last edited by tshering; 04-04-2014 at 01:54 PM. |
||
![]() |
![]() |
![]() |
#7 |
Giant Hobbit
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 49
Karma: 487552
Join Date: Aug 2009
Location: Turkey
Device: Kobo: Clara, Mini, Aura HD, Aura 2, Kindle: Paperwhite 1, DX 1
|
@tshering:
I have successfully done the merge now. But there is a problem: The two dictionaries weigh 6.8MB and 2.1MB each. However, the merged dictionary is about 1.1MB. How come? Have I done something wrong? |
![]() |
![]() |
![]() |
#8 | |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 3,489
Karma: 2914715
Join Date: Jun 2012
Device: kobo touch
|
Congratualtions!
Quote:
Edit: I tried now penelope for the first time. When converting two kobo dictionaries, the definitions get lost. For instance Code:
<w><a name="Alpha"/><div><b>Alpha</b><br/>This is the definition for Alpha</div></w> Code:
<html><w><a name="Alpha"/><div><b>Alpha</b><br/></div></w> Last edited by tshering; 04-04-2014 at 05:25 PM. |
|
![]() |
![]() |
![]() |
#9 |
Evangelist
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 495
Karma: 356531
Join Date: Jul 2016
Location: 'burta, Canada
Device: Kobo Glo HD
|
I hate to raise a thread from the dead, but I think it should be possible to do this in Penelope. However, I don’t know Python so I’m hoping someone out there can as I don’t think it would take much modification to add this functionality in (basically alter two lines of code).
Currently, Penelope does have a function to read in a zipped Kobo file (so you could pass 'kobo' as an option for the -i switch right now even though it isn't in the documentation), but it only reads in the index because “The read function only acquires the index, as the definition files of the original Kobo dictionaries are obfuscated/encrypted.” Which is why the read loop explicitly passes an empty string rather than a definition: Code:
for pair in trie.items(): dictionary.add_entry(headword=pair[0], definition=u"") EDIT: Maybe I spoke too soon. It looks like SOME dictionaries may be encrypted and some may not (my OCD may drive me to make a list when I have time). I might take a look at extending Penelope to be able to process unencrypted dictionaries anyways, because why not? Still need to figure out how Penelope works and learn enough Python/Marisa to figure out where and what to change though. Anyway, I think if we can extract the definition and input that instead of that empty string, then Penelope should work like it does for the other formats with unencrypted Kobo dictionaries (and I can't tell if Penelope or Marisa gunzips or even opens any of those html files in the first place; if not, that functionality would need to be coded in too). Assuming the gzip thing isn't an issue, that's where I'm stuck, though. I would have assumed that pair[1] would hold the definition, but it instead holds a number (then again, I have no idea how tries work and I find the Marisa tutorial somewhat lacking for my level of understanding). I don't know what to do with that number to extract the definition (use it to look something up in another array maybe?). I did confirm that it'll spit out whatever string you place there into the html file underneath the headword, so clearly, that's where one would put the extracted definition. The validation test would be to run something like this: Code:
penelope -i dicthtml-en-ja.zip -j kobo -f ja -t en -p kobo -o dicthtml-ja-en Last edited by rtiangha; 10-27-2019 at 11:42 PM. |
![]() |
![]() |
![]() |
#10 |
Evangelist
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 495
Karma: 356531
Join Date: Jul 2016
Location: 'burta, Canada
Device: Kobo Glo HD
|
OK, maybe this isn't as trivial as I thought. If I'm understanding this correctly, all Marisa keeps track of is a key (in this case, the headword) and an id (which is a number). It's up to you to use that id for your own purposes, but unless it's tied to a record somehow, I think that id is useless and Marisa in this case is only really useful for super fast fuzzy searches on headwords. The database of definitions is the html files themselves and while I might be overthinking this, I think you still have to write the logic to extract the first two letters of the headword to find the correct html file and then parse it to find the right definition, unless there's a library call that exists that does just that (maybe Kobo wrote one for themselves).
So now I'm wondering if what needs to be done instead is to use Marisa to build up the original list of headwords from its index file, and then use that as a guide to go through all the html files and ingest the definitions so that Penelope can then manipulate them (in fact, since the headwords themselves are in the html files, I don't even think reading the Marisa index file is even needed in the first place because you can regenerate the original words file from the html entries themselves). In which case, that's a lot of string manipulation that I've always been weak on, and in a programming language that I'm not familiar with in the first place. On the plus side, the XML seems consistent (i.e. Each definition is enclosed in <w> tags and the headwords are under <a name= > tags so I assume there's an XML library that makes it easy to parse and manipulate that stuff (although I'm not well versed in XML either so I don't know). On the plus side, the rest of the code looks consistent in its behaviour, so once the Kobo dictionary data is ingested properly, the rest of Penelope should work the same. But yeah, at the moment, I think this might be beyond my skill, at least until I can teach myself the various languages and libraries to figure out how to program this. I might have better luck writing a utility in a language I'm familiar with to merge just Kobo dictionaries since all you'd need to do is merge (and maybe sort?) entries in html files with the same name (XSLT looks like it might do the job) and then create a combined words list indexed with Marisa and then zip everything up together. At least the Marisa stuff doesn't look complicated. Last edited by rtiangha; 10-28-2019 at 03:47 AM. |
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
PRS-T2 Difference jailbreak.zip/rupor-jailbreak.7z & minimal-root-zeam.zip/rupor-minimal.7z? | hyperstruct | Sony Reader Dev Corner | 9 | 01-06-2013 02:05 PM |
Un-Merge? | 4Catsnadog | Library Management | 6 | 08-22-2011 03:46 PM |
Just What Does Merge Do? | Pinecone | Library Management | 5 | 01-29-2011 06:43 AM |
What exactly does merge do? | bigpallooka | Calibre | 15 | 11-24-2010 06:58 PM |
Merge feature request (different merge) | Tarran | Calibre | 1 | 05-24-2010 10:57 AM |