MobiPocket dictionary index extraction/format/method

Fish-Face · 06-30-2016, 04:08 PM

Hi all,

I'm trying to do something a bit weird: I like the experience of looking up words on Kindle so much that I'd like to be able to do something a bit similar on the PC, rather than going to my favourite online dictionary. Therefore I envisage first of all a program which accepts a word and a dictionary file, and which should render the appropriate entry in the dictionary.

Now, I have unpacked the .mobi of my dictionary with KindleUnpack, had a look through the HTML and see that the entries are there and use tags like <idx

rth value="word"> and <idx:iform value="inflectedword"/> to describe the entries which are then given in ordinary HTML. I can, using any old XML parser, scan through the .html file and search for the given word, and return the relevant HTML: it would then be easy to use webkit or another renderer to render it.

The problem is that the unpacked HTML is 50-odd megabytes, so if I enter a word beginning with Z I have to wait 10 seconds or so to scan through everything. My little kindle handles it faster than this, so I'd like to do better. Now first off, I'm looping through in python, which is a slow language. I could do better if I wrote it in C++, but that's a pain. But presumably, when a .mobi file is created, all those indices are assembled together in some manner for quick lookup of the relevant locations.

I don't see any evidence of this in the unpacked file, but it's well possible that KindleUnpack just discards this information. I'd like to know if it's possible to extract this information or, failing that, what format it takes so that I can create something similar. The idea would then be a method which I can use to generate my own lookup table for any KindleUnpack-extracted file for rapid indexing into the XML. The trouble is that the most obvious way I can think of - just noting the byte position in the file where the relevant data starts - doesn't work well with my current method of using a standard (and therefore fast) XML parser to extract the info. I could not try to use a parser to extract the correct location (in my example dictionary, the next <div> after the <idx:...> tag contains the entry, and there is no nesting etc, so I could get it with a simple regex) but this would then discard all the ancestor elements which might be useful, and might break with unusual dictionaries. Taking this approach I may as well do something even more straightforward and just search the file for value="..." instances and not bother with XML parsing at all.

Any thoughts?

Doitsu · 07-01-2016, 03:18 AM

Since you appear to have the necessary editing skills, I'd recommend that you convert the decompiled dictionary to a Babylon source file (GLS) using regular expressions and/or sed.

For example, you'd need to convert the following English-German sample dictionary:

Code:

<html>
<body>

<idx:entry>
	<b><idx:orth>book
	<idx:infl>
		<idx:iform value="books"/>
	</idx:infl>
	</idx:orth> </b> 
	<i>Subst.</i> <br/>
	Buch (n)
</idx:entry>
<hr/>
<idx:entry>
	<b><idx:orth>go
	<idx:infl>
		<idx:iform value="goes"/>
		<idx:iform value="going"/>
		<idx:iform value="went"/>
		<idx:iform value="gone"/>
	</idx:infl>
	</idx:orth> </b> 
	<i>Verb</i> <br/>
	gehen
</idx:entry>


</body>
</html>

as follows:

Code:

#stripmethod=keep
#sametypesequence=h
#bookname=English-German Dictionary

book|books
<i>Subst.</i><br/>Buch (n)

go|goes|going|went|gone
<i>Verb</i><br/>gehen

You could then generate a StarDict dictionary (DICT & IFO) with StarDict Editor or a Babylon glossary file (BGL) with Babylon Glossary Builder.
Both dictionary types can be used with GoldenDict and many other dictionary apps.

Fish-Face · 07-01-2016, 05:51 AM

That's a good alternative I hadn't considered; thanks! I'll look into those apps first of all.

Doitsu · 07-01-2016, 06:38 AM

Quote:

Originally Posted by Fish-Face

That's a good alternative I hadn't considered; thanks! I'll look into those apps first of all.

BTW, you could also use the old Mobipocket Reader as a popup dictionary app.
(Select Tools > Settings > Lookup > Enable Lookup in any Windows application.)

Obviously, the dictionary needs to be DRM-free.

HarryT · 07-02-2016, 12:14 PM

Doing a linear search of the dictionary is of course the slowest possible way to do it. Given that you have programming skills, you could extract all the index entries to a separate file, storing the file offset of each one. You could then do a binary search of this index file, which would be enormously faster (eg it'll find any word in a 60,000-word dictionary with at most 16 comparisons, rather than the average of 30,000 comparisons that the linear search will require).

Fish-Face · 07-02-2016, 09:10 PM

Quote:

Originally Posted by HarryT

Doing a linear search of the dictionary is of course the slowest possible way to do it. Given that you have programming skills, you could extract all the index entries to a separate file, storing the file offset of each one. You could then do a binary search of this index file, which would be enormously faster (eg it'll find any word in a 60,000-word dictionary with at most 16 comparisons, rather than the average of 30,000 comparisons that the linear search will require).

I'd even considered doing a binary search within the dictionary file, since it's laid out in alphabetical order. I'm now resigned to being unable to parse the file "properly" if I keep it as the mobipocket file. By that I mean, I will have to have an index into the file (as a byte position) - so the HTML hierarchy will be lost or rather, have to be assumed, since we don't know how far into the file relevant structure may be.

In fact I think this is how mobipocket does it natively - the anchors in the file have hrefs/ids "fileposxyz" which seem to be byte indices or something.

Fish-Face · 07-03-2016, 09:52 AM

Another irritation presented itself, which is that the dictionary I currently want to convert appears to not follow the format rules properly, namely the dictionary entries are not within the <idx:entry> tag, but simply appear afterwards in the pseudo-HTML. For this file it seems sufficient to output every <div> that appears after the <idx:entry> until the start of the next <idx:entry>.

06-30-2016, 04:08 PM	#1
Fish-Face Junior Member Posts: 9 Karma: 10 Join Date: Feb 2016 Device: Kindle PW3	MobiPocket dictionary index extraction/format/method Hi all, I'm trying to do something a bit weird: I like the experience of looking up words on Kindle so much that I'd like to be able to do something a bit similar on the PC, rather than going to my favourite online dictionary. Therefore I envisage first of all a program which accepts a word and a dictionary file, and which should render the appropriate entry in the dictionary. Now, I have unpacked the .mobi of my dictionary with KindleUnpack, had a look through the HTML and see that the entries are there and use tags like <idxrth value="word"> and <idx:iform value="inflectedword"/> to describe the entries which are then given in ordinary HTML. I can, using any old XML parser, scan through the .html file and search for the given word, and return the relevant HTML: it would then be easy to use webkit or another renderer to render it. The problem is that the unpacked HTML is 50-odd megabytes, so if I enter a word beginning with Z I have to wait 10 seconds or so to scan through everything. My little kindle handles it faster than this, so I'd like to do better. Now first off, I'm looping through in python, which is a slow language. I could do better if I wrote it in C++, but that's a pain. But presumably, when a .mobi file is created, all those indices are assembled together in some manner for quick lookup of the relevant locations. I don't see any evidence of this in the unpacked file, but it's well possible that KindleUnpack just discards this information. I'd like to know if it's possible to extract this information or, failing that, what format it takes so that I can create something similar. The idea would then be a method which I can use to generate my own lookup table for any KindleUnpack-extracted file for rapid indexing into the XML. The trouble is that the most obvious way I can think of - just noting the byte position in the file where the relevant data starts - doesn't work well with my current method of using a standard (and therefore fast) XML parser to extract the info. I could not try to use a parser to extract the correct location (in my example dictionary, the next <div> after the <idx:...> tag contains the entry, and there is no nesting etc, so I could get it with a simple regex) but this would then discard all the ancestor elements which might be useful, and might break with unusual dictionaries. Taking this approach I may as well do something even more straightforward and just search the file for value="..." instances and not bother with XML parsing at all. Any thoughts?

07-01-2016, 03:18 AM	#2
Doitsu Grand Sorcerer Posts: 5,583 Karma: 22735033 Join Date: Dec 2010 Device: Kindle PW2	Since you appear to have the necessary editing skills, I'd recommend that you convert the decompiled dictionary to a Babylon source file (GLS) using regular expressions and/or sed. For example, you'd need to convert the following English-German sample dictionary: Code: <html> <body> <idx:entry> <b><idx:orth>book <idx:infl> <idx:iform value="books"/> </idx:infl> </idx:orth> </b> <i>Subst.</i> <br/> Buch (n) </idx:entry> <hr/> <idx:entry> <b><idx:orth>go <idx:infl> <idx:iform value="goes"/> <idx:iform value="going"/> <idx:iform value="went"/> <idx:iform value="gone"/> </idx:infl> </idx:orth> </b> <i>Verb</i> <br/> gehen </idx:entry> </body> </html> as follows: Code: #stripmethod=keep #sametypesequence=h #bookname=English-German Dictionary book\|books <i>Subst.</i><br/>Buch (n) go\|goes\|going\|went\|gone <i>Verb</i><br/>gehen You could then generate a StarDict dictionary (DICT & IFO) with StarDict Editor or a Babylon glossary file (BGL) with Babylon Glossary Builder. Both dictionary types can be used with GoldenDict and many other dictionary apps.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Kindle index files: file path hash method	Doitsu	Kindle Developer's Corner	7	12-23-2014 09:02 PM
index_search & dictionary inflection index	Doitsu	Kindle Formats	2	03-11-2011 03:39 PM
Mobi format metadata extraction issues	FrancisT	Calibre	7	01-22-2009 01:34 AM
Automatic index links creation in mobipocket	ragdoll	Kindle Formats	1	02-08-2008 07:07 AM
the Concise Oxford English Dictionary available in Mobipocket format	Mobipocket	Deals and Resources (No Self-Promotion or Affiliate Links)	7	07-17-2007 07:25 PM

07-01-2016, 05:51 AM	#3
Fish-Face Junior Member Posts: 9 Karma: 10 Join Date: Feb 2016 Device: Kindle PW3	That's a good alternative I hadn't considered; thanks! I'll look into those apps first of all.

07-02-2016, 12:14 PM	#5
HarryT eBook Enthusiast Posts: 85,544 Karma: 93383043 Join Date: Nov 2006 Location: UK Device: Kindle Oasis 2, iPad Pro 10.5", iPhone 6	Doing a linear search of the dictionary is of course the slowest possible way to do it. Given that you have programming skills, you could extract all the index entries to a separate file, storing the file offset of each one. You could then do a binary search of this index file, which would be enormously faster (eg it'll find any word in a 60,000-word dictionary with at most 16 comparisons, rather than the average of 30,000 comparisons that the linear search will require).

07-03-2016, 09:52 AM	#7
Fish-Face Junior Member Posts: 9 Karma: 10 Join Date: Feb 2016 Device: Kindle PW3	Another irritation presented itself, which is that the dictionary I currently want to convert appears to not follow the format rules properly, namely the dictionary entries are not within the <idx:entry> tag, but simply appear afterwards in the pseudo-HTML. For this file it seems sufficient to output every <div> that appears after the <idx:entry> until the start of the next <idx:entry>.

Advert

Advert