06-30-2016, 04:08 PM | #1 |
Junior Member
Posts: 9
Karma: 10
Join Date: Feb 2016
Device: Kindle PW3
|
MobiPocket dictionary index extraction/format/method
Hi all,
I'm trying to do something a bit weird: I like the experience of looking up words on Kindle so much that I'd like to be able to do something a bit similar on the PC, rather than going to my favourite online dictionary. Therefore I envisage first of all a program which accepts a word and a dictionary file, and which should render the appropriate entry in the dictionary. Now, I have unpacked the .mobi of my dictionary with KindleUnpack, had a look through the HTML and see that the entries are there and use tags like <idxrth value="word"> and <idx:iform value="inflectedword"/> to describe the entries which are then given in ordinary HTML. I can, using any old XML parser, scan through the .html file and search for the given word, and return the relevant HTML: it would then be easy to use webkit or another renderer to render it. The problem is that the unpacked HTML is 50-odd megabytes, so if I enter a word beginning with Z I have to wait 10 seconds or so to scan through everything. My little kindle handles it faster than this, so I'd like to do better. Now first off, I'm looping through in python, which is a slow language. I could do better if I wrote it in C++, but that's a pain. But presumably, when a .mobi file is created, all those indices are assembled together in some manner for quick lookup of the relevant locations. I don't see any evidence of this in the unpacked file, but it's well possible that KindleUnpack just discards this information. I'd like to know if it's possible to extract this information or, failing that, what format it takes so that I can create something similar. The idea would then be a method which I can use to generate my own lookup table for any KindleUnpack-extracted file for rapid indexing into the XML. The trouble is that the most obvious way I can think of - just noting the byte position in the file where the relevant data starts - doesn't work well with my current method of using a standard (and therefore fast) XML parser to extract the info. I could not try to use a parser to extract the correct location (in my example dictionary, the next <div> after the <idx:...> tag contains the entry, and there is no nesting etc, so I could get it with a simple regex) but this would then discard all the ancestor elements which might be useful, and might break with unusual dictionaries. Taking this approach I may as well do something even more straightforward and just search the file for value="..." instances and not bother with XML parsing at all. Any thoughts? |
07-01-2016, 03:18 AM | #2 |
Grand Sorcerer
Posts: 5,583
Karma: 22735033
Join Date: Dec 2010
Device: Kindle PW2
|
Since you appear to have the necessary editing skills, I'd recommend that you convert the decompiled dictionary to a Babylon source file (GLS) using regular expressions and/or sed.
For example, you'd need to convert the following English-German sample dictionary: Code:
<html> <body> <idx:entry> <b><idx:orth>book <idx:infl> <idx:iform value="books"/> </idx:infl> </idx:orth> </b> <i>Subst.</i> <br/> Buch (n) </idx:entry> <hr/> <idx:entry> <b><idx:orth>go <idx:infl> <idx:iform value="goes"/> <idx:iform value="going"/> <idx:iform value="went"/> <idx:iform value="gone"/> </idx:infl> </idx:orth> </b> <i>Verb</i> <br/> gehen </idx:entry> </body> </html> Code:
#stripmethod=keep #sametypesequence=h #bookname=English-German Dictionary book|books <i>Subst.</i><br/>Buch (n) go|goes|going|went|gone <i>Verb</i><br/>gehen Both dictionary types can be used with GoldenDict and many other dictionary apps. |
Advert | |
|
07-01-2016, 05:51 AM | #3 |
Junior Member
Posts: 9
Karma: 10
Join Date: Feb 2016
Device: Kindle PW3
|
That's a good alternative I hadn't considered; thanks! I'll look into those apps first of all.
|
07-01-2016, 06:38 AM | #4 | |
Grand Sorcerer
Posts: 5,583
Karma: 22735033
Join Date: Dec 2010
Device: Kindle PW2
|
Quote:
(Select Tools > Settings > Lookup > Enable Lookup in any Windows application.) Obviously, the dictionary needs to be DRM-free. Last edited by Doitsu; 07-01-2016 at 06:42 AM. |
|
07-02-2016, 12:14 PM | #5 |
eBook Enthusiast
Posts: 85,544
Karma: 93383043
Join Date: Nov 2006
Location: UK
Device: Kindle Oasis 2, iPad Pro 10.5", iPhone 6
|
Doing a linear search of the dictionary is of course the slowest possible way to do it. Given that you have programming skills, you could extract all the index entries to a separate file, storing the file offset of each one. You could then do a binary search of this index file, which would be enormously faster (eg it'll find any word in a 60,000-word dictionary with at most 16 comparisons, rather than the average of 30,000 comparisons that the linear search will require).
|
Advert | |
|
07-02-2016, 09:10 PM | #6 | |
Junior Member
Posts: 9
Karma: 10
Join Date: Feb 2016
Device: Kindle PW3
|
Quote:
In fact I think this is how mobipocket does it natively - the anchors in the file have hrefs/ids "fileposxyz" which seem to be byte indices or something. |
|
07-03-2016, 09:52 AM | #7 |
Junior Member
Posts: 9
Karma: 10
Join Date: Feb 2016
Device: Kindle PW3
|
Another irritation presented itself, which is that the dictionary I currently want to convert appears to not follow the format rules properly, namely the dictionary entries are not within the <idx:entry> tag, but simply appear afterwards in the pseudo-HTML. For this file it seems sufficient to output every <div> that appears after the <idx:entry> until the start of the next <idx:entry>.
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Kindle index files: file path hash method | Doitsu | Kindle Developer's Corner | 7 | 12-23-2014 09:02 PM |
index_search & dictionary inflection index | Doitsu | Kindle Formats | 2 | 03-11-2011 03:39 PM |
Mobi format metadata extraction issues | FrancisT | Calibre | 7 | 01-22-2009 01:34 AM |
Automatic index links creation in mobipocket | ragdoll | Kindle Formats | 1 | 02-08-2008 07:07 AM |
the Concise Oxford English Dictionary available in Mobipocket format | Mobipocket | Deals and Resources (No Self-Promotion or Affiliate Links) | 7 | 07-17-2007 07:25 PM |