Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Readers > Kobo Reader

Notices

Reply
 
Thread Tools Search this Thread
Old 04-01-2014, 11:16 AM   #1
Majorix
Giant Hobbit
Majorix ought to be getting tired of karma fortunes by now.Majorix ought to be getting tired of karma fortunes by now.Majorix ought to be getting tired of karma fortunes by now.Majorix ought to be getting tired of karma fortunes by now.Majorix ought to be getting tired of karma fortunes by now.Majorix ought to be getting tired of karma fortunes by now.Majorix ought to be getting tired of karma fortunes by now.Majorix ought to be getting tired of karma fortunes by now.Majorix ought to be getting tired of karma fortunes by now.Majorix ought to be getting tired of karma fortunes by now.Majorix ought to be getting tired of karma fortunes by now.
 
Posts: 49
Karma: 487552
Join Date: Aug 2009
Location: Turkey
Device: Kobo: Clara, Mini, Aura HD, Aura 2, Kindle: Paperwhite 1, DX 1
Merge .zip Dictionaries

I have followed a guide I found in these forums, and I have successfully created a custom en->tr translation dictionary. It works fine. However, I want to merge this dictionary (let's say its name is dicthtml-eng-tur.zip) with Kobo's official en->en dictionary (copied from the device, named dicthtml.zip), and possibly with 1 or 2 additions like WordNet or something.

I have seen how you can do it before you create the dictionaries, using penelope again. But now I have the .zip files and I have seen no example usage of how you merge .zip dictionaries! Can someone help?
Majorix is offline   Reply With Quote
Old 04-01-2014, 09:29 PM   #2
arasyi
Zealot
arasyi ought to be getting tired of karma fortunes by now.arasyi ought to be getting tired of karma fortunes by now.arasyi ought to be getting tired of karma fortunes by now.arasyi ought to be getting tired of karma fortunes by now.arasyi ought to be getting tired of karma fortunes by now.arasyi ought to be getting tired of karma fortunes by now.arasyi ought to be getting tired of karma fortunes by now.arasyi ought to be getting tired of karma fortunes by now.arasyi ought to be getting tired of karma fortunes by now.arasyi ought to be getting tired of karma fortunes by now.arasyi ought to be getting tired of karma fortunes by now.
 
Posts: 105
Karma: 5885446
Join Date: Feb 2014
Device: Kobo Glo
Correct me if I'm wrong, but isn't the Kobo's official dictionary encrypted?
arasyi is offline   Reply With Quote
Advert
Old 04-02-2014, 09:32 AM   #3
Majorix
Giant Hobbit
Majorix ought to be getting tired of karma fortunes by now.Majorix ought to be getting tired of karma fortunes by now.Majorix ought to be getting tired of karma fortunes by now.Majorix ought to be getting tired of karma fortunes by now.Majorix ought to be getting tired of karma fortunes by now.Majorix ought to be getting tired of karma fortunes by now.Majorix ought to be getting tired of karma fortunes by now.Majorix ought to be getting tired of karma fortunes by now.Majorix ought to be getting tired of karma fortunes by now.Majorix ought to be getting tired of karma fortunes by now.Majorix ought to be getting tired of karma fortunes by now.
 
Posts: 49
Karma: 487552
Join Date: Aug 2009
Location: Turkey
Device: Kobo: Clara, Mini, Aura HD, Aura 2, Kindle: Paperwhite 1, DX 1
Is it? I don't know much, sorry. Then is it possible to merge the en->tr dictionary with WordNet or other (unencrypted) dictionary and replace kobo's built-in one?

Last edited by Majorix; 04-02-2014 at 02:36 PM.
Majorix is offline   Reply With Quote
Old 04-04-2014, 07:34 AM   #4
Majorix
Giant Hobbit
Majorix ought to be getting tired of karma fortunes by now.Majorix ought to be getting tired of karma fortunes by now.Majorix ought to be getting tired of karma fortunes by now.Majorix ought to be getting tired of karma fortunes by now.Majorix ought to be getting tired of karma fortunes by now.Majorix ought to be getting tired of karma fortunes by now.Majorix ought to be getting tired of karma fortunes by now.Majorix ought to be getting tired of karma fortunes by now.Majorix ought to be getting tired of karma fortunes by now.Majorix ought to be getting tired of karma fortunes by now.Majorix ought to be getting tired of karma fortunes by now.
 
Posts: 49
Karma: 487552
Join Date: Aug 2009
Location: Turkey
Device: Kobo: Clara, Mini, Aura HD, Aura 2, Kindle: Paperwhite 1, DX 1
Guys, nobody has tried this yet?
Majorix is offline   Reply With Quote
Old 04-04-2014, 12:47 PM   #5
Uschiekid
Tenrec
Uschiekid ought to be getting tired of karma fortunes by now.Uschiekid ought to be getting tired of karma fortunes by now.Uschiekid ought to be getting tired of karma fortunes by now.Uschiekid ought to be getting tired of karma fortunes by now.Uschiekid ought to be getting tired of karma fortunes by now.Uschiekid ought to be getting tired of karma fortunes by now.Uschiekid ought to be getting tired of karma fortunes by now.Uschiekid ought to be getting tired of karma fortunes by now.Uschiekid ought to be getting tired of karma fortunes by now.Uschiekid ought to be getting tired of karma fortunes by now.Uschiekid ought to be getting tired of karma fortunes by now.
 
Posts: 724
Karma: 1076988
Join Date: Oct 2012
Device: Kobo Aura One, Kobo Glo
Quote:
Originally Posted by Majorix View Post
Guys, nobody has tried this yet?
Someone did this for me with Japanese and Japanese-English, dictionaries ... One being a kobo provided dictionary, the other created by a user on this forum. Memory says it was tshering... No idea how he did it though. So should be possible...


edit: just checked my PMs, and it was indeed tshering who did this for me, try PMing him and asking how he did it!
Uschiekid is offline   Reply With Quote
Advert
Old 04-04-2014, 01:51 PM   #6
tshering
Wizard
tshering ought to be getting tired of karma fortunes by now.tshering ought to be getting tired of karma fortunes by now.tshering ought to be getting tired of karma fortunes by now.tshering ought to be getting tired of karma fortunes by now.tshering ought to be getting tired of karma fortunes by now.tshering ought to be getting tired of karma fortunes by now.tshering ought to be getting tired of karma fortunes by now.tshering ought to be getting tired of karma fortunes by now.tshering ought to be getting tired of karma fortunes by now.tshering ought to be getting tired of karma fortunes by now.tshering ought to be getting tired of karma fortunes by now.
 
Posts: 3,489
Karma: 2914715
Join Date: Jun 2012
Device: kobo touch
Quote:
Originally Posted by Uschiekid View Post
edit: just checked my PMs, and it was indeed tshering who did this for me, try PMing him and asking how he did it!
No need to PM. I can say it here. I extracted the definitions from the Kobo dictionary, merged them with the definitions of the other dictionaries, put them into html files, made and index file with marisa, and compressed the whole thing into a new dictionary file. I did this ad hoc, therefore I cannot share a tool chain or give detailed explanations.

Majorix, since you use penelope, the best way for you would be as follows (this is how I understand it from the penelope homepage): Bring all dictionaries you want to combine into one and the same format that penelope understands, and then merge them with penelope.

Quote:
I have seen how you can do it before you create the dictionaries, using penelope again. But now I have the .zip files and I have seen no example usage of how you merge .zip dictionaries! Can someone help?
On the penelope homepage I see that one feature is "merge more dictionaries (of the same type) into a single dictionary," and among the supported formats "Kobo" is listed. From this it seems that penelope can merge zipped (unencrypted Kobo) dictionaries. How exactly one does this, you have to find out. I guess you can find instructions or samples in the penelope package.

Last edited by tshering; 04-04-2014 at 01:54 PM.
tshering is offline   Reply With Quote
Old 04-04-2014, 03:41 PM   #7
Majorix
Giant Hobbit
Majorix ought to be getting tired of karma fortunes by now.Majorix ought to be getting tired of karma fortunes by now.Majorix ought to be getting tired of karma fortunes by now.Majorix ought to be getting tired of karma fortunes by now.Majorix ought to be getting tired of karma fortunes by now.Majorix ought to be getting tired of karma fortunes by now.Majorix ought to be getting tired of karma fortunes by now.Majorix ought to be getting tired of karma fortunes by now.Majorix ought to be getting tired of karma fortunes by now.Majorix ought to be getting tired of karma fortunes by now.Majorix ought to be getting tired of karma fortunes by now.
 
Posts: 49
Karma: 487552
Join Date: Aug 2009
Location: Turkey
Device: Kobo: Clara, Mini, Aura HD, Aura 2, Kindle: Paperwhite 1, DX 1
@tshering:
I have successfully done the merge now. But there is a problem: The two dictionaries weigh 6.8MB and 2.1MB each. However, the merged dictionary is about 1.1MB. How come? Have I done something wrong?
Majorix is offline   Reply With Quote
Old 04-04-2014, 03:52 PM   #8
tshering
Wizard
tshering ought to be getting tired of karma fortunes by now.tshering ought to be getting tired of karma fortunes by now.tshering ought to be getting tired of karma fortunes by now.tshering ought to be getting tired of karma fortunes by now.tshering ought to be getting tired of karma fortunes by now.tshering ought to be getting tired of karma fortunes by now.tshering ought to be getting tired of karma fortunes by now.tshering ought to be getting tired of karma fortunes by now.tshering ought to be getting tired of karma fortunes by now.tshering ought to be getting tired of karma fortunes by now.tshering ought to be getting tired of karma fortunes by now.
 
Posts: 3,489
Karma: 2914715
Join Date: Jun 2012
Device: kobo touch
Quote:
Originally Posted by Majorix View Post
I have successfully done the merge now.
Congratualtions!

Quote:
Originally Posted by Majorix View Post
But there is a problem: The two dictionaries weigh 6.8MB and 2.1MB each. However, the merged dictionary is about 1.1MB. How come? Have I done something wrong?
Were the two dictionaries already zipped dictionaries (dicthml-something.zip)? If yes, then this difference in size looks rather suspicious. You could unzip the merged dictionary, decompress one or the other .html file (it is gzipped) and look what is really inside. Maybe this gives you a hint what has happened. Did you try the new dictionary already on the Kobo?

Edit: I tried now penelope for the first time. When converting two kobo dictionaries, the definitions get lost. For instance

Code:
<w><a name="Alpha"/><div><b>Alpha</b><br/>This is the definition for Alpha</div></w>
of one input dictionary becomes
Code:
<html><w><a name="Alpha"/><div><b>Alpha</b><br/></div></w>
in the merged dictionary. Is it the same with your dictionary?

Last edited by tshering; 04-04-2014 at 05:25 PM.
tshering is offline   Reply With Quote
Old 10-27-2019, 04:51 PM   #9
rtiangha
Evangelist
rtiangha ought to be getting tired of karma fortunes by now.rtiangha ought to be getting tired of karma fortunes by now.rtiangha ought to be getting tired of karma fortunes by now.rtiangha ought to be getting tired of karma fortunes by now.rtiangha ought to be getting tired of karma fortunes by now.rtiangha ought to be getting tired of karma fortunes by now.rtiangha ought to be getting tired of karma fortunes by now.rtiangha ought to be getting tired of karma fortunes by now.rtiangha ought to be getting tired of karma fortunes by now.rtiangha ought to be getting tired of karma fortunes by now.rtiangha ought to be getting tired of karma fortunes by now.
 
Posts: 495
Karma: 356531
Join Date: Jul 2016
Location: 'burta, Canada
Device: Kobo Glo HD
I hate to raise a thread from the dead, but I think it should be possible to do this in Penelope. However, I don’t know Python so I’m hoping someone out there can as I don’t think it would take much modification to add this functionality in (basically alter two lines of code).

Currently, Penelope does have a function to read in a zipped Kobo file (so you could pass 'kobo' as an option for the -i switch right now even though it isn't in the documentation), but it only reads in the index because “The read function only acquires the index, as the definition files of the original Kobo dictionaries are obfuscated/encrypted.”

Which is why the read loop explicitly passes an empty string rather than a definition:

Code:
            for pair in trie.items():
                dictionary.add_entry(headword=pair[0], definition=u"")
However, we now know that the entries aren’t encrypted; they’re just gzipped (or at least, that's the case for some of the dictionaries; I haven't tried every single one...yet). You can verify this for yourself by taking any of the .html files, renaming them to .html.gz, run gunzip on them, and the resultant .html file is completely readable!

EDIT: Maybe I spoke too soon. It looks like SOME dictionaries may be encrypted and some may not (my OCD may drive me to make a list when I have time). I might take a look at extending Penelope to be able to process unencrypted dictionaries anyways, because why not? Still need to figure out how Penelope works and learn enough Python/Marisa to figure out where and what to change though.

Anyway, I think if we can extract the definition and input that instead of that empty string, then Penelope should work like it does for the other formats with unencrypted Kobo dictionaries (and I can't tell if Penelope or Marisa gunzips or even opens any of those html files in the first place; if not, that functionality would need to be coded in too).

Assuming the gzip thing isn't an issue, that's where I'm stuck, though. I would have assumed that pair[1] would hold the definition, but it instead holds a number (then again, I have no idea how tries work and I find the Marisa tutorial somewhat lacking for my level of understanding). I don't know what to do with that number to extract the definition (use it to look something up in another array maybe?). I did confirm that it'll spit out whatever string you place there into the html file underneath the headword, so clearly, that's where one would put the extracted definition. The validation test would be to run something like this:

Code:
penelope -i dicthtml-en-ja.zip -j kobo -f ja -t en -p kobo -o dicthtml-ja-en
and the resultant dicthtml-ja-en.zip file would be exactly the same as the original dicthtml-en-ja.zip file.

Last edited by rtiangha; 10-27-2019 at 11:42 PM.
rtiangha is offline   Reply With Quote
Old 10-28-2019, 03:09 AM   #10
rtiangha
Evangelist
rtiangha ought to be getting tired of karma fortunes by now.rtiangha ought to be getting tired of karma fortunes by now.rtiangha ought to be getting tired of karma fortunes by now.rtiangha ought to be getting tired of karma fortunes by now.rtiangha ought to be getting tired of karma fortunes by now.rtiangha ought to be getting tired of karma fortunes by now.rtiangha ought to be getting tired of karma fortunes by now.rtiangha ought to be getting tired of karma fortunes by now.rtiangha ought to be getting tired of karma fortunes by now.rtiangha ought to be getting tired of karma fortunes by now.rtiangha ought to be getting tired of karma fortunes by now.
 
Posts: 495
Karma: 356531
Join Date: Jul 2016
Location: 'burta, Canada
Device: Kobo Glo HD
OK, maybe this isn't as trivial as I thought. If I'm understanding this correctly, all Marisa keeps track of is a key (in this case, the headword) and an id (which is a number). It's up to you to use that id for your own purposes, but unless it's tied to a record somehow, I think that id is useless and Marisa in this case is only really useful for super fast fuzzy searches on headwords. The database of definitions is the html files themselves and while I might be overthinking this, I think you still have to write the logic to extract the first two letters of the headword to find the correct html file and then parse it to find the right definition, unless there's a library call that exists that does just that (maybe Kobo wrote one for themselves).

So now I'm wondering if what needs to be done instead is to use Marisa to build up the original list of headwords from its index file, and then use that as a guide to go through all the html files and ingest the definitions so that Penelope can then manipulate them (in fact, since the headwords themselves are in the html files, I don't even think reading the Marisa index file is even needed in the first place because you can regenerate the original words file from the html entries themselves). In which case, that's a lot of string manipulation that I've always been weak on, and in a programming language that I'm not familiar with in the first place. On the plus side, the XML seems consistent (i.e. Each definition is enclosed in <w> tags and the headwords are under <a name= > tags so I assume there's an XML library that makes it easy to parse and manipulate that stuff (although I'm not well versed in XML either so I don't know).

On the plus side, the rest of the code looks consistent in its behaviour, so once the Kobo dictionary data is ingested properly, the rest of Penelope should work the same. But yeah, at the moment, I think this might be beyond my skill, at least until I can teach myself the various languages and libraries to figure out how to program this. I might have better luck writing a utility in a language I'm familiar with to merge just Kobo dictionaries since all you'd need to do is merge (and maybe sort?) entries in html files with the same name (XSLT looks like it might do the job) and then create a combined words list indexed with Marisa and then zip everything up together. At least the Marisa stuff doesn't look complicated.

Last edited by rtiangha; 10-28-2019 at 03:47 AM.
rtiangha is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
PRS-T2 Difference jailbreak.zip/rupor-jailbreak.7z & minimal-root-zeam.zip/rupor-minimal.7z? hyperstruct Sony Reader Dev Corner 9 01-06-2013 02:05 PM
Un-Merge? 4Catsnadog Library Management 6 08-22-2011 03:46 PM
Just What Does Merge Do? Pinecone Library Management 5 01-29-2011 06:43 AM
What exactly does merge do? bigpallooka Calibre 15 11-24-2010 06:58 PM
Merge feature request (different merge) Tarran Calibre 1 05-24-2010 10:57 AM


All times are GMT -4. The time now is 10:14 AM.


MobileRead.com is a privately owned, operated and funded community.