MobileRead Forums - View Single Post - MobiPocket dictionary index extraction/format/method

Fish-Face · 07-02-2016, 09:10 PM

Quote:

Originally Posted by HarryT

Doing a linear search of the dictionary is of course the slowest possible way to do it. Given that you have programming skills, you could extract all the index entries to a separate file, storing the file offset of each one. You could then do a binary search of this index file, which would be enormously faster (eg it'll find any word in a 60,000-word dictionary with at most 16 comparisons, rather than the average of 30,000 comparisons that the linear search will require).

I'd even considered doing a binary search within the dictionary file, since it's laid out in alphabetical order. I'm now resigned to being unable to parse the file "properly" if I keep it as the mobipocket file. By that I mean, I will have to have an index into the file (as a byte position) - so the HTML hierarchy will be lost or rather, have to be assumed, since we don't know how far into the file relevant structure may be.

In fact I think this is how mobipocket does it natively - the anchors in the file have hrefs/ids "fileposxyz" which seem to be byte indices or something.