MobileRead Forums - View Single Post

Loceka · 01-10-2013, 02:41 PM

Hello,
I already posted that on another thread but I see it does belong here instead :

By combining the method described in this thread to generate Kobo dictionaries and the one of that thread allowing the extraction of .mobi dictionaries, I've wrote a Perl script that creates a Kobo dictionary from an extracted mobi :

Launch it on a linux, using:

Code:

perl mobi2kobo.pl -i <input file> -o <output dir>

For the moment it is expecting a cp1252 (WinLatin1) encoded HTML Mobi file as input file. If the input file is UTF8 encoded, it should be changed in the source code.

Also, I noted that some keywords were badly encoded while their definition was correct so I try to estimate whether it happens or not and retrieve the correct keyword. As it is based only on the word length, it may fail (and should therefore be deactivated).

By the time I made the script, I thought the xx.html (all but the 11.html) where reserved to letters only but after reading your posts I see I was mistaken (and the way I did it should be corrected: comparing the uppercase vs the lowercase prefixes).

Still, I noted that if a space character was among the first two letters, the word had to be put in the xa.html instead of the 11.html file.

And I also have a question :
The official Kobo dictionaries seem to have changed a bit and their *.html files now cannot be opened with gzip or 7zip. (you may try on the last released english dictionary)
So have you found a way to open the official .html files ?

Thanks,
Loceka.

01-10-2013, 02:41 PM	#191
Loceka Member Posts: 24 Karma: 10 Join Date: Jan 2013 Device: Kobo Glo	Hello, I already posted that on another thread but I see it does belong here instead : By combining the method described in this thread to generate Kobo dictionaries and the one of that thread allowing the extraction of .mobi dictionaries, I've wrote a Perl script that creates a Kobo dictionary from an extracted mobi : Launch it on a linux, using: Code: perl mobi2kobo.pl -i <input file> -o <output dir> For the moment it is expecting a cp1252 (WinLatin1) encoded HTML Mobi file as input file. If the input file is UTF8 encoded, it should be changed in the source code. Also, I noted that some keywords were badly encoded while their definition was correct so I try to estimate whether it happens or not and retrieve the correct keyword. As it is based only on the word length, it may fail (and should therefore be deactivated). By the time I made the script, I thought the xx.html (all but the 11.html) where reserved to letters only but after reading your posts I see I was mistaken (and the way I did it should be corrected: comparing the uppercase vs the lowercase prefixes). Still, I noted that if a space character was among the first two letters, the word had to be put in the xa.html instead of the 11.html file. And I also have a question : The official Kobo dictionaries seem to have changed a bit and their *.html files now cannot be opened with gzip or 7zip. (you may try on the last released english dictionary) So have you found a way to open the official .html files ? Thanks, Loceka.