View Single Post
Old 02-02-2014, 06:09 PM   #1
User_Name
Junior Member
User_Name began at the beginning.
 
Posts: 5
Karma: 10
Join Date: Feb 2014
Device: (Bookeen(Gen3|FrontLightHD)|KoboAuraHD)
DE-PL Kobo Dictionary

As the German-German kobo dictionary sucks compared to what you can get for free (and even compared to German-English kobo dictionary... meh), I decided to make my own.

I decided not to crawl on pons.eu due to lack of time (though it's feasible), but used instead word database free to download from http://www.depl.pl/ .
Resulting dictionary in attachment.

What I needed to do besides what tutorials tell to do:
1) this particular word database has strange format - it's like TAB but uses double space instead (sed 's/ /\t/' a.utf8 > depl.tab)
stardict-editor created stardict from TAB as soon as I removed repetitions (program verbosely complained what am I missing)
2) 'python2.7 penelope.py -f de -t pl -p depl --output-kobo' prepared *.html files and applied marisa to them with success, then failed to create 'words' and zipped with wrong filename encoding
3) 'cut -f1 depl.tab > words' cured the first problem
4) I needed to pack the files using Windows version of 7zip (wine'd), as Linux 7z, p7zip and zip failed to make the shitty encoding (i.e. encoded filenames in UTF-8, instead of whatever ancient code page zip uses) please read below

The dictionary has been installed over some italian-english one which I doubt to ever use.
I tried to insert in sqlite database the de-pl one. 'Manage dictionaries' shows correctly the dict, but the reader must have the list built-in somewhere in libnickel1.0.0 (don't have toolchain to play with it, but with binary editor you can see hard-coded list of dictionaries...)

(As for the license - the webpage states that you can copy at will unless you alter the software or want any cash: http://www.depl.pl/licencja.html . I assume this does not cover changing file format, especially as the website offers the dictionary in Kindle format for free)
_____________________________________
Edit:
I got confuzed by the encoding....
Windows version of 7zip created an archive I could list and exctact so that all ö and ß were intact. Linux zip and (p)7z(ip) created an archive that when listed or extracted replaced all national characters with garbage (eg. zö.html → zö.html).
But Kobo did not recognize the correctly encoded zip (i.e. the one that extracted fine), but accepted the shitty one. Results: with windows 7z unmöglich was un.html and definition was displayed, but möglich was in mö.html and was not displayed.
What I find strange is that the dicts shipped with Kobo work correctly both on device and linux/windows. To check if encoding is fine in kobo, you can probably telnet and unzip -l (at least this worked in my case). I've replaced the dict in attachment with a corrected one...
tl;dr: "zip dicthtml-de-pl.zip *" works in Kobo even if äöüß are garbage on PC
Attached Files
File Type: zip dicthtml-de-pl.zip (1.62 MB, 1203 views)

Last edited by User_Name; 02-05-2014 at 07:06 AM. Reason: dict update - I mixed up encodings prev.
User_Name is offline   Reply With Quote