Thanks go to the forum members at Mobileread, particularly in the New dictionary format of firmware 2.14 thread, for working out the format of the Kobo dictionaries, and how to create new ones.
Version | Date | Description | Author |
---|---|---|---|
1.0 | 22 November 2012 | First version. | ShellShock |
1.1 | 23 November 2012 | Improvements suggested by tshering. | ShellShock |
The following instructions are for Windows and Unix. Be prepared - for all but the smallest dictionaries, you will probably need some coding or scripting skills to convert your source dictionary into a format from which the Kobo dictionary can be built.
<?xml version="1.0" encoding="utf-8"?> <html> <w><p><a name="word"/>Definition of word. Most HTML tags are allowed.</p></w> </html>Although you can use most html tags in the definitions, links to resources outside the html file do not work. So, anchor tags (a) with hrefs to other html files, do not work. This is a real pity, because it means you cannot link from a word definition to another word in the dictionary. Here is an example aa.html file:
<?xml version="1.0" encoding="utf-8"?> <html> <w><p><a name="aardvark"/><b>aardvark<b/>. A mammal native to Africa.</p></w> <w><p><a name="atom"/><b>atom<b/>. A fundamental particle.</p></w> <w><p><a name="atom"/><b>atom<b/>. An extremely small amount.</p></w> </html>The example also shows that if a word has multiple definitions, then you should create a w tag for each definition. Single letter words must go into an html file named xa.html, where x is the word. For example, the word I should be defined in the file ia.html. Use lower case letters for your html file names, but the case for the word in the definitions is not so important. So, in the ia.html file we might have:
<?xml version="1.0" encoding="utf-8"?> <html> <w><p><a name="I"/>First person pronoun.</p></w> <w><p><a name="i"/>A mathematical symbol.</p></w> </html>Words where either the first or second character are not a letter should be defined in a file called 11.html, for example the words 1a and o'clock should be defined in 11.html. I recommend that you use UTF-8 file encoding for the html files - this will give you a very wide range of characters to choose from, including a lot of symbols (which can often be used to replace images in the source dictionary); the Kobo has good support for UTF-8 in dictionaries (I have not yet found a character it will not display correctly). Although it would be possible to create the html files manually using a text editor, this would require a lot of work, especially if your source dictionary has a lot of entries. So if you have any sort of coding skills, now is the time to use them! Also bear in mind that source dictionaries come in many different file formats, so there is no single solution for converting them into the html format required for the Kobo.
a aardvark aargh abackVery important - line endings in the index.txt file must be in Unix format. That is, they must just be a line-feed character (10), and not the normal carriage-return (13) + line-feed (10) used by Windows. This will keep the marisa-build tool happy (see later).
7z a -tgzip "compressed\aa.html" "aa.html"This puts the compressed aa.html file into a compressed sub-directory. The equivalent Unix command is:
gzip aa.html > compressed/aa.html
marisa-build -owords index.txtThis creates a file called words, which contains your indexed words. To test the index, run marisa-lookup:
marisa-lookup wordsAt the marisa-lookup prompt (a blank line), type in one of your indexed words, and hit Enter. You should get a number > -1 displayed, which is the key for the word in the index. If you get back -1 then the word is not indexed - check that you have used Unix line endings in your index.txt file!
7z a -tzip dicthtml.zip *.html wordsOn Unix this is:
zip dicthtml.zip *html wordsThis will create a dictionary file that will replace the English dictionary on the Kobo. If you want to replace a different language dictionary, then use the appropriate suffix, e.g., dicthtml-de.zip for German, dicthtml-nl.zip for Dutch.
Good luck!