Merge .zip Dictionaries

Majorix · 04-01-2014, 11:16 AM

I have followed a guide I found in these forums, and I have successfully created a custom en->tr translation dictionary. It works fine. However, I want to merge this dictionary (let's say its name is dicthtml-eng-tur.zip) with Kobo's official en->en dictionary (copied from the device, named dicthtml.zip), and possibly with 1 or 2 additions like WordNet or something.

I have seen how you can do it before you create the dictionaries, using penelope again. But now I have the .zip files and I have seen no example usage of how you merge .zip dictionaries! Can someone help?

arasyi · 04-01-2014, 09:29 PM

Correct me if I'm wrong, but isn't the Kobo's official dictionary encrypted?

Majorix · 04-02-2014, 09:32 AM

Is it? I don't know much, sorry. Then is it possible to merge the en->tr dictionary with WordNet or other (unencrypted) dictionary and replace kobo's built-in one?

Majorix · 04-04-2014, 07:34 AM

Guys, nobody has tried this yet?

Uschiekid · 04-04-2014, 12:47 PM

Quote:

Originally Posted by Majorix

Guys, nobody has tried this yet?

Someone did this for me with Japanese and Japanese-English, dictionaries ... One being a kobo provided dictionary, the other created by a user on this forum. Memory says it was tshering... No idea how he did it though. So should be possible...

edit: just checked my PMs, and it was indeed tshering who did this for me, try PMing him and asking how he did it!

tshering · 04-04-2014, 01:51 PM

Quote:

Originally Posted by Uschiekid

edit: just checked my PMs, and it was indeed tshering who did this for me, try PMing him and asking how he did it!

No need to PM. I can say it here. I extracted the definitions from the Kobo dictionary, merged them with the definitions of the other dictionaries, put them into html files, made and index file with marisa, and compressed the whole thing into a new dictionary file. I did this ad hoc, therefore I cannot share a tool chain or give detailed explanations.

Majorix, since you use penelope, the best way for you would be as follows (this is how I understand it from the penelope homepage): Bring all dictionaries you want to combine into one and the same format that penelope understands, and then merge them with penelope.

Quote:

I have seen how you can do it before you create the dictionaries, using penelope again. But now I have the .zip files and I have seen no example usage of how you merge .zip dictionaries! Can someone help?

On the penelope homepage I see that one feature is "merge more dictionaries (of the same type) into a single dictionary," and among the supported formats "Kobo" is listed. From this it seems that penelope can merge zipped (unencrypted Kobo) dictionaries. How exactly one does this, you have to find out. I guess you can find instructions or samples in the penelope package.

Majorix · 04-04-2014, 03:41 PM

@tshering:
I have successfully done the merge now. But there is a problem: The two dictionaries weigh 6.8MB and 2.1MB each. However, the merged dictionary is about 1.1MB. How come? Have I done something wrong?

tshering · 04-04-2014, 03:52 PM

Quote:

Originally Posted by Majorix

I have successfully done the merge now.

Congratualtions!

Quote:

Originally Posted by Majorix

But there is a problem: The two dictionaries weigh 6.8MB and 2.1MB each. However, the merged dictionary is about 1.1MB. How come? Have I done something wrong?

Were the two dictionaries already zipped dictionaries (dicthml-something.zip)? If yes, then this difference in size looks rather suspicious. You could unzip the merged dictionary, decompress one or the other .html file (it is gzipped) and look what is really inside. Maybe this gives you a hint what has happened. Did you try the new dictionary already on the Kobo?

Edit: I tried now penelope for the first time. When converting two kobo dictionaries, the definitions get lost. For instance

Code:

<w><a name="Alpha"/><div><b>Alpha</b><br/>This is the definition for Alpha</div></w>

of one input dictionary becomes

Code:

<html><w><a name="Alpha"/><div><b>Alpha</b><br/></div></w>

in the merged dictionary. Is it the same with your dictionary?

rtiangha · 10-27-2019, 04:51 PM

I hate to raise a thread from the dead, but I think it should be possible to do this in Penelope. However, I don’t know Python so I’m hoping someone out there can as I don’t think it would take much modification to add this functionality in (basically alter two lines of code).

Currently, Penelope does have a function to read in a zipped Kobo file (so you could pass 'kobo' as an option for the -i switch right now even though it isn't in the documentation), but it only reads in the index because “The read function only acquires the index, as the definition files of the original Kobo dictionaries are obfuscated/encrypted.”

Which is why the read loop explicitly passes an empty string rather than a definition:

Code:

            for pair in trie.items():
                dictionary.add_entry(headword=pair[0], definition=u"")

However, we now know that the entries aren’t encrypted; they’re just gzipped (or at least, that's the case for some of the dictionaries; I haven't tried every single one...yet). You can verify this for yourself by taking any of the .html files, renaming them to .html.gz, run gunzip on them, and the resultant .html file is completely readable!

EDIT: Maybe I spoke too soon. It looks like SOME dictionaries may be encrypted and some may not (my OCD may drive me to make a list when I have time). I might take a look at extending Penelope to be able to process unencrypted dictionaries anyways, because why not? Still need to figure out how Penelope works and learn enough Python/Marisa to figure out where and what to change though.

Anyway, I think if we can extract the definition and input that instead of that empty string, then Penelope should work like it does for the other formats with unencrypted Kobo dictionaries (and I can't tell if Penelope or Marisa gunzips or even opens any of those html files in the first place; if not, that functionality would need to be coded in too).

Assuming the gzip thing isn't an issue, that's where I'm stuck, though. I would have assumed that pair[1] would hold the definition, but it instead holds a number (then again, I have no idea how tries work and I find the Marisa tutorial somewhat lacking for my level of understanding). I don't know what to do with that number to extract the definition (use it to look something up in another array maybe?). I did confirm that it'll spit out whatever string you place there into the html file underneath the headword, so clearly, that's where one would put the extracted definition. The validation test would be to run something like this:

Code:

penelope -i dicthtml-en-ja.zip -j kobo -f ja -t en -p kobo -o dicthtml-ja-en

and the resultant dicthtml-ja-en.zip file would be exactly the same as the original dicthtml-en-ja.zip file.

rtiangha · 10-28-2019, 03:09 AM

OK, maybe this isn't as trivial as I thought. If I'm understanding this correctly, all Marisa keeps track of is a key (in this case, the headword) and an id (which is a number). It's up to you to use that id for your own purposes, but unless it's tied to a record somehow, I think that id is useless and Marisa in this case is only really useful for super fast fuzzy searches on headwords. The database of definitions is the html files themselves and while I might be overthinking this, I think you still have to write the logic to extract the first two letters of the headword to find the correct html file and then parse it to find the right definition, unless there's a library call that exists that does just that (maybe Kobo wrote one for themselves).

So now I'm wondering if what needs to be done instead is to use Marisa to build up the original list of headwords from its index file, and then use that as a guide to go through all the html files and ingest the definitions so that Penelope can then manipulate them (in fact, since the headwords themselves are in the html files, I don't even think reading the Marisa index file is even needed in the first place because you can regenerate the original words file from the html entries themselves). In which case, that's a lot of string manipulation that I've always been weak on, and in a programming language that I'm not familiar with in the first place. On the plus side, the XML seems consistent (i.e. Each definition is enclosed in <w> tags and the headwords are under <a name= > tags so I assume there's an XML library that makes it easy to parse and manipulate that stuff (although I'm not well versed in XML either so I don't know).

On the plus side, the rest of the code looks consistent in its behaviour, so once the Kobo dictionary data is ingested properly, the rest of Penelope should work the same. But yeah, at the moment, I think this might be beyond my skill, at least until I can teach myself the various languages and libraries to figure out how to program this. I might have better luck writing a utility in a language I'm familiar with to merge just Kobo dictionaries since all you'd need to do is merge (and maybe sort?) entries in html files with the same name (XSLT looks like it might do the job) and then create a combined words list indexed with Marisa and then zip everything up together. At least the Marisa stuff doesn't look complicated.

04-01-2014, 11:16 AM	#1
Majorix Giant Hobbit Posts: 49 Karma: 487552 Join Date: Aug 2009 Location: Turkey Device: Kobo: Clara, Mini, Aura HD, Aura 2, Kindle: Paperwhite 1, DX 1	Merge .zip Dictionaries I have followed a guide I found in these forums, and I have successfully created a custom en->tr translation dictionary. It works fine. However, I want to merge this dictionary (let's say its name is dicthtml-eng-tur.zip) with Kobo's official en->en dictionary (copied from the device, named dicthtml.zip), and possibly with 1 or 2 additions like WordNet or something. I have seen how you can do it before you create the dictionaries, using penelope again. But now I have the .zip files and I have seen no example usage of how you merge .zip dictionaries! Can someone help?

04-02-2014, 09:32 AM	#3
Majorix Giant Hobbit Posts: 49 Karma: 487552 Join Date: Aug 2009 Location: Turkey Device: Kobo: Clara, Mini, Aura HD, Aura 2, Kindle: Paperwhite 1, DX 1	Is it? I don't know much, sorry. Then is it possible to merge the en->tr dictionary with WordNet or other (unencrypted) dictionary and replace kobo's built-in one? Last edited by Majorix; 04-02-2014 at 02:36 PM.

10-27-2019, 04:51 PM	#9
rtiangha Evangelist Posts: 495 Karma: 356531 Join Date: Jul 2016 Location: 'burta, Canada Device: Kobo Glo HD	I hate to raise a thread from the dead, but I think it should be possible to do this in Penelope. However, I don’t know Python so I’m hoping someone out there can as I don’t think it would take much modification to add this functionality in (basically alter two lines of code). Currently, Penelope does have a function to read in a zipped Kobo file (so you could pass 'kobo' as an option for the -i switch right now even though it isn't in the documentation), but it only reads in the index because “The read function only acquires the index, as the definition files of the original Kobo dictionaries are obfuscated/encrypted.” Which is why the read loop explicitly passes an empty string rather than a definition: Code: for pair in trie.items(): dictionary.add_entry(headword=pair[0], definition=u"") However, we now know that the entries aren’t encrypted; they’re just gzipped (or at least, that's the case for some of the dictionaries; I haven't tried every single one...yet). You can verify this for yourself by taking any of the .html files, renaming them to .html.gz, run gunzip on them, and the resultant .html file is completely readable! EDIT: Maybe I spoke too soon. It looks like SOME dictionaries may be encrypted and some may not (my OCD may drive me to make a list when I have time). I might take a look at extending Penelope to be able to process unencrypted dictionaries anyways, because why not? Still need to figure out how Penelope works and learn enough Python/Marisa to figure out where and what to change though. Anyway, I think if we can extract the definition and input that instead of that empty string, then Penelope should work like it does for the other formats with unencrypted Kobo dictionaries (and I can't tell if Penelope or Marisa gunzips or even opens any of those html files in the first place; if not, that functionality would need to be coded in too). Assuming the gzip thing isn't an issue, that's where I'm stuck, though. I would have assumed that pair[1] would hold the definition, but it instead holds a number (then again, I have no idea how tries work and I find the Marisa tutorial somewhat lacking for my level of understanding). I don't know what to do with that number to extract the definition (use it to look something up in another array maybe?). I did confirm that it'll spit out whatever string you place there into the html file underneath the headword, so clearly, that's where one would put the extracted definition. The validation test would be to run something like this: Code: penelope -i dicthtml-en-ja.zip -j kobo -f ja -t en -p kobo -o dicthtml-ja-en and the resultant dicthtml-ja-en.zip file would be exactly the same as the original dicthtml-en-ja.zip file. Last edited by rtiangha; 10-27-2019 at 11:42 PM.

10-28-2019, 03:09 AM	#10
rtiangha Evangelist Posts: 495 Karma: 356531 Join Date: Jul 2016 Location: 'burta, Canada Device: Kobo Glo HD	OK, maybe this isn't as trivial as I thought. If I'm understanding this correctly, all Marisa keeps track of is a key (in this case, the headword) and an id (which is a number). It's up to you to use that id for your own purposes, but unless it's tied to a record somehow, I think that id is useless and Marisa in this case is only really useful for super fast fuzzy searches on headwords. The database of definitions is the html files themselves and while I might be overthinking this, I think you still have to write the logic to extract the first two letters of the headword to find the correct html file and then parse it to find the right definition, unless there's a library call that exists that does just that (maybe Kobo wrote one for themselves). So now I'm wondering if what needs to be done instead is to use Marisa to build up the original list of headwords from its index file, and then use that as a guide to go through all the html files and ingest the definitions so that Penelope can then manipulate them (in fact, since the headwords themselves are in the html files, I don't even think reading the Marisa index file is even needed in the first place because you can regenerate the original words file from the html entries themselves). In which case, that's a lot of string manipulation that I've always been weak on, and in a programming language that I'm not familiar with in the first place. On the plus side, the XML seems consistent (i.e. Each definition is enclosed in <w> tags and the headwords are under <a name= > tags so I assume there's an XML library that makes it easy to parse and manipulate that stuff (although I'm not well versed in XML either so I don't know). On the plus side, the rest of the code looks consistent in its behaviour, so once the Kobo dictionary data is ingested properly, the rest of Penelope should work the same. But yeah, at the moment, I think this might be beyond my skill, at least until I can teach myself the various languages and libraries to figure out how to program this. I might have better luck writing a utility in a language I'm familiar with to merge just Kobo dictionaries since all you'd need to do is merge (and maybe sort?) entries in html files with the same name (XSLT looks like it might do the job) and then create a combined words list indexed with Marisa and then zip everything up together. At least the Marisa stuff doesn't look complicated. Last edited by rtiangha; 10-28-2019 at 03:47 AM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
PRS-T2 Difference jailbreak.zip/rupor-jailbreak.7z & minimal-root-zeam.zip/rupor-minimal.7z?	hyperstruct	Sony Reader Dev Corner	9	01-06-2013 02:05 PM
Un-Merge?	4Catsnadog	Library Management	6	08-22-2011 03:46 PM
Just What Does Merge Do?	Pinecone	Library Management	5	01-29-2011 06:43 AM
What exactly does merge do?	bigpallooka	Calibre	15	11-24-2010 06:58 PM
Merge feature request (different merge)	Tarran	Calibre	1	05-24-2010 10:57 AM

04-01-2014, 09:29 PM	#2
arasyi Zealot Posts: 105 Karma: 5885446 Join Date: Feb 2014 Device: Kobo Glo	Correct me if I'm wrong, but isn't the Kobo's official dictionary encrypted?

04-04-2014, 07:34 AM	#4
Majorix Giant Hobbit Posts: 49 Karma: 487552 Join Date: Aug 2009 Location: Turkey Device: Kobo: Clara, Mini, Aura HD, Aura 2, Kindle: Paperwhite 1, DX 1	Guys, nobody has tried this yet?

04-04-2014, 03:41 PM	#7
Majorix Giant Hobbit Posts: 49 Karma: 487552 Join Date: Aug 2009 Location: Turkey Device: Kobo: Clara, Mini, Aura HD, Aura 2, Kindle: Paperwhite 1, DX 1	@tshering: I have successfully done the merge now. But there is a problem: The two dictionaries weigh 6.8MB and 2.1MB each. However, the merged dictionary is about 1.1MB. How come? Have I done something wrong?

Advert

Advert