MobileRead Forums - View Single Post

rtiangha · 10-28-2019, 03:09 AM

OK, maybe this isn't as trivial as I thought. If I'm understanding this correctly, all Marisa keeps track of is a key (in this case, the headword) and an id (which is a number). It's up to you to use that id for your own purposes, but unless it's tied to a record somehow, I think that id is useless and Marisa in this case is only really useful for super fast fuzzy searches on headwords. The database of definitions is the html files themselves and while I might be overthinking this, I think you still have to write the logic to extract the first two letters of the headword to find the correct html file and then parse it to find the right definition, unless there's a library call that exists that does just that (maybe Kobo wrote one for themselves).

So now I'm wondering if what needs to be done instead is to use Marisa to build up the original list of headwords from its index file, and then use that as a guide to go through all the html files and ingest the definitions so that Penelope can then manipulate them (in fact, since the headwords themselves are in the html files, I don't even think reading the Marisa index file is even needed in the first place because you can regenerate the original words file from the html entries themselves). In which case, that's a lot of string manipulation that I've always been weak on, and in a programming language that I'm not familiar with in the first place. On the plus side, the XML seems consistent (i.e. Each definition is enclosed in <w> tags and the headwords are under <a name= > tags so I assume there's an XML library that makes it easy to parse and manipulate that stuff (although I'm not well versed in XML either so I don't know).

On the plus side, the rest of the code looks consistent in its behaviour, so once the Kobo dictionary data is ingested properly, the rest of Penelope should work the same. But yeah, at the moment, I think this might be beyond my skill, at least until I can teach myself the various languages and libraries to figure out how to program this. I might have better luck writing a utility in a language I'm familiar with to merge just Kobo dictionaries since all you'd need to do is merge (and maybe sort?) entries in html files with the same name (XSLT looks like it might do the job) and then create a combined words list indexed with Marisa and then zip everything up together. At least the Marisa stuff doesn't look complicated.

10-28-2019, 03:09 AM	#10
rtiangha Evangelist Posts: 495 Karma: 356531 Join Date: Jul 2016 Location: 'burta, Canada Device: Kobo Glo HD	OK, maybe this isn't as trivial as I thought. If I'm understanding this correctly, all Marisa keeps track of is a key (in this case, the headword) and an id (which is a number). It's up to you to use that id for your own purposes, but unless it's tied to a record somehow, I think that id is useless and Marisa in this case is only really useful for super fast fuzzy searches on headwords. The database of definitions is the html files themselves and while I might be overthinking this, I think you still have to write the logic to extract the first two letters of the headword to find the correct html file and then parse it to find the right definition, unless there's a library call that exists that does just that (maybe Kobo wrote one for themselves). So now I'm wondering if what needs to be done instead is to use Marisa to build up the original list of headwords from its index file, and then use that as a guide to go through all the html files and ingest the definitions so that Penelope can then manipulate them (in fact, since the headwords themselves are in the html files, I don't even think reading the Marisa index file is even needed in the first place because you can regenerate the original words file from the html entries themselves). In which case, that's a lot of string manipulation that I've always been weak on, and in a programming language that I'm not familiar with in the first place. On the plus side, the XML seems consistent (i.e. Each definition is enclosed in <w> tags and the headwords are under <a name= > tags so I assume there's an XML library that makes it easy to parse and manipulate that stuff (although I'm not well versed in XML either so I don't know). On the plus side, the rest of the code looks consistent in its behaviour, so once the Kobo dictionary data is ingested properly, the rest of Penelope should work the same. But yeah, at the moment, I think this might be beyond my skill, at least until I can teach myself the various languages and libraries to figure out how to program this. I might have better luck writing a utility in a language I'm familiar with to merge just Kobo dictionaries since all you'd need to do is merge (and maybe sort?) entries in html files with the same name (XSLT looks like it might do the job) and then create a combined words list indexed with Marisa and then zip everything up together. At least the Marisa stuff doesn't look complicated. Last edited by rtiangha; 10-28-2019 at 03:47 AM.