Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Formats > Kindle Formats

Notices

Reply
 
Thread Tools Search this Thread
Old 06-30-2016, 04:08 PM   #1
Fish-Face
Junior Member
Fish-Face began at the beginning.
 
Posts: 9
Karma: 10
Join Date: Feb 2016
Device: Kindle PW3
MobiPocket dictionary index extraction/format/method

Hi all,

I'm trying to do something a bit weird: I like the experience of looking up words on Kindle so much that I'd like to be able to do something a bit similar on the PC, rather than going to my favourite online dictionary. Therefore I envisage first of all a program which accepts a word and a dictionary file, and which should render the appropriate entry in the dictionary.

Now, I have unpacked the .mobi of my dictionary with KindleUnpack, had a look through the HTML and see that the entries are there and use tags like <idxrth value="word"> and <idx:iform value="inflectedword"/> to describe the entries which are then given in ordinary HTML. I can, using any old XML parser, scan through the .html file and search for the given word, and return the relevant HTML: it would then be easy to use webkit or another renderer to render it.

The problem is that the unpacked HTML is 50-odd megabytes, so if I enter a word beginning with Z I have to wait 10 seconds or so to scan through everything. My little kindle handles it faster than this, so I'd like to do better. Now first off, I'm looping through in python, which is a slow language. I could do better if I wrote it in C++, but that's a pain. But presumably, when a .mobi file is created, all those indices are assembled together in some manner for quick lookup of the relevant locations.

I don't see any evidence of this in the unpacked file, but it's well possible that KindleUnpack just discards this information. I'd like to know if it's possible to extract this information or, failing that, what format it takes so that I can create something similar. The idea would then be a method which I can use to generate my own lookup table for any KindleUnpack-extracted file for rapid indexing into the XML. The trouble is that the most obvious way I can think of - just noting the byte position in the file where the relevant data starts - doesn't work well with my current method of using a standard (and therefore fast) XML parser to extract the info. I could not try to use a parser to extract the correct location (in my example dictionary, the next <div> after the <idx:...> tag contains the entry, and there is no nesting etc, so I could get it with a simple regex) but this would then discard all the ancestor elements which might be useful, and might break with unusual dictionaries. Taking this approach I may as well do something even more straightforward and just search the file for value="..." instances and not bother with XML parsing at all.

Any thoughts?
Fish-Face is offline   Reply With Quote
Old 07-01-2016, 03:18 AM   #2
Doitsu
Grand Sorcerer
Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.
 
Doitsu's Avatar
 
Posts: 5,583
Karma: 22735033
Join Date: Dec 2010
Device: Kindle PW2
Since you appear to have the necessary editing skills, I'd recommend that you convert the decompiled dictionary to a Babylon source file (GLS) using regular expressions and/or sed.

For example, you'd need to convert the following English-German sample dictionary:

Code:
<html>
<body>

<idx:entry>
	<b><idx:orth>book
	<idx:infl>
		<idx:iform value="books"/>
	</idx:infl>
	</idx:orth> </b> 
	<i>Subst.</i> <br/>
	Buch (n)
</idx:entry>
<hr/>
<idx:entry>
	<b><idx:orth>go
	<idx:infl>
		<idx:iform value="goes"/>
		<idx:iform value="going"/>
		<idx:iform value="went"/>
		<idx:iform value="gone"/>
	</idx:infl>
	</idx:orth> </b> 
	<i>Verb</i> <br/>
	gehen
</idx:entry>


</body>
</html>
as follows:

Code:
#stripmethod=keep
#sametypesequence=h
#bookname=English-German Dictionary

book|books
<i>Subst.</i><br/>Buch (n)

go|goes|going|went|gone
<i>Verb</i><br/>gehen
You could then generate a StarDict dictionary (DICT & IFO) with StarDict Editor or a Babylon glossary file (BGL) with Babylon Glossary Builder.
Both dictionary types can be used with GoldenDict and many other dictionary apps.
Doitsu is offline   Reply With Quote
Advert
Old 07-01-2016, 05:51 AM   #3
Fish-Face
Junior Member
Fish-Face began at the beginning.
 
Posts: 9
Karma: 10
Join Date: Feb 2016
Device: Kindle PW3
That's a good alternative I hadn't considered; thanks! I'll look into those apps first of all.
Fish-Face is offline   Reply With Quote
Old 07-01-2016, 06:38 AM   #4
Doitsu
Grand Sorcerer
Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.
 
Doitsu's Avatar
 
Posts: 5,583
Karma: 22735033
Join Date: Dec 2010
Device: Kindle PW2
Quote:
Originally Posted by Fish-Face View Post
That's a good alternative I hadn't considered; thanks! I'll look into those apps first of all.
BTW, you could also use the old Mobipocket Reader as a popup dictionary app.
(Select Tools > Settings > Lookup > Enable Lookup in any Windows application.)

Obviously, the dictionary needs to be DRM-free.

Last edited by Doitsu; 07-01-2016 at 06:42 AM.
Doitsu is offline   Reply With Quote
Old 07-02-2016, 12:14 PM   #5
HarryT
eBook Enthusiast
HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.
 
HarryT's Avatar
 
Posts: 85,544
Karma: 93383043
Join Date: Nov 2006
Location: UK
Device: Kindle Oasis 2, iPad Pro 10.5", iPhone 6
Doing a linear search of the dictionary is of course the slowest possible way to do it. Given that you have programming skills, you could extract all the index entries to a separate file, storing the file offset of each one. You could then do a binary search of this index file, which would be enormously faster (eg it'll find any word in a 60,000-word dictionary with at most 16 comparisons, rather than the average of 30,000 comparisons that the linear search will require).
HarryT is offline   Reply With Quote
Advert
Old 07-02-2016, 09:10 PM   #6
Fish-Face
Junior Member
Fish-Face began at the beginning.
 
Posts: 9
Karma: 10
Join Date: Feb 2016
Device: Kindle PW3
Quote:
Originally Posted by HarryT View Post
Doing a linear search of the dictionary is of course the slowest possible way to do it. Given that you have programming skills, you could extract all the index entries to a separate file, storing the file offset of each one. You could then do a binary search of this index file, which would be enormously faster (eg it'll find any word in a 60,000-word dictionary with at most 16 comparisons, rather than the average of 30,000 comparisons that the linear search will require).
I'd even considered doing a binary search within the dictionary file, since it's laid out in alphabetical order. I'm now resigned to being unable to parse the file "properly" if I keep it as the mobipocket file. By that I mean, I will have to have an index into the file (as a byte position) - so the HTML hierarchy will be lost or rather, have to be assumed, since we don't know how far into the file relevant structure may be.

In fact I think this is how mobipocket does it natively - the anchors in the file have hrefs/ids "fileposxyz" which seem to be byte indices or something.
Fish-Face is offline   Reply With Quote
Old 07-03-2016, 09:52 AM   #7
Fish-Face
Junior Member
Fish-Face began at the beginning.
 
Posts: 9
Karma: 10
Join Date: Feb 2016
Device: Kindle PW3
Another irritation presented itself, which is that the dictionary I currently want to convert appears to not follow the format rules properly, namely the dictionary entries are not within the <idx:entry> tag, but simply appear afterwards in the pseudo-HTML. For this file it seems sufficient to output every <div> that appears after the <idx:entry> until the start of the next <idx:entry>.
Fish-Face is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Kindle index files: file path hash method Doitsu Kindle Developer's Corner 7 12-23-2014 09:02 PM
index_search & dictionary inflection index Doitsu Kindle Formats 2 03-11-2011 03:39 PM
Mobi format metadata extraction issues FrancisT Calibre 7 01-22-2009 01:34 AM
Automatic index links creation in mobipocket ragdoll Kindle Formats 1 02-08-2008 07:07 AM
the Concise Oxford English Dictionary available in Mobipocket format Mobipocket Deals and Resources (No Self-Promotion or Affiliate Links) 7 07-17-2007 07:25 PM


All times are GMT -4. The time now is 04:56 AM.


MobileRead.com is a privately owned, operated and funded community.