View Single Post
Old 11-04-2012, 12:56 PM   #25
AlPe
Digital Amanuensis
AlPe ought to be getting tired of karma fortunes by now.AlPe ought to be getting tired of karma fortunes by now.AlPe ought to be getting tired of karma fortunes by now.AlPe ought to be getting tired of karma fortunes by now.AlPe ought to be getting tired of karma fortunes by now.AlPe ought to be getting tired of karma fortunes by now.AlPe ought to be getting tired of karma fortunes by now.AlPe ought to be getting tired of karma fortunes by now.AlPe ought to be getting tired of karma fortunes by now.AlPe ought to be getting tired of karma fortunes by now.AlPe ought to be getting tired of karma fortunes by now.
 
AlPe's Avatar
 
Posts: 595
Karma: 1281565
Join Date: Dec 2011
Location: Padova, Italy
Device: Kindle3, Odyssey, eDGe, A60, PRS-T1, iPad3, KoboGlo
Quote:
Originally Posted by tshering View Post
This "words" file is used to get the information whether an expression that is looked-up can be found in the respective language dictionary or not, and maybe some other information.
All you need to get from a query W to file "words" is: 1) which chunk (file .html(.gz)) contains the word W and its definition; 2) which is the "position" of word W in that chunk.

For 1), usually one assigns an ID to each chunk, like this: 11.html is 0, aa.html is 1, etc. in lexicographical order.
For 2), an easy way is to store, for word W, the offset, in bytes from the beginning of the chunk, where the definition of W starts.

(The dictionary is slit into several chunks to allow faster fetch-decompress-find operations)

See my analysis of the Cybook Odyssey dictionaries at: http://www.albertopettarin.it/penelope.html

Quote:
Originally Posted by tshering View Post
If we knew what the values of the key-value pairs consist of we could build our own "words" file. This again would enable us to insert new entries into the language dictionaries, or to build up a new dictionary from scrap. How the values look like should be easily ascertained with the marisa tools. However, I failed in my attempts. Therefore, I can only speculate about it.
That's the point. I haven't had the chance of playing with the marisa-lib. Understanding the content of the (decompressed) index (file "words") is the key point there.
AlPe is offline   Reply With Quote