View Single Post
Old 03-13-2012, 02:17 PM   #35
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 7,645
Karma: 5433388
Join Date: Nov 2009
Device: many
Hi,

Okay, I have some test code that uses the latest version of lxml and html5lib and it can take an epub and process things to recreate the divtbl, skeltbl, and othtbl. I am sure it is full of bugs and missing features but all of that can be fixed.

I know lxml is already part of calibre, but what about html5lib? Is it okay to use that python module as well?

I seem to be stuck on how to take the divtbl, skeltbl, othtbl data and convert it to the correct set of index records to be stored in the mobi. The code in mobi/writer2/index.py seems to be hard coded to just the tags used for ncx's and not for general tag maps.

From the mobi_unpack code, I know the tag used (tag number, number of values, tag mask, etc) for each type of index (skelidx, dividx, othidx). Of course the meaning of what a 2 tag, or 3 tag is different for each type of index.

So it seems what I need to do is take the mobi/writer2/index.py and change it to have the structured data, list of ctoc strings, and the tag info (tag number, number of values, mask) all passed in and have it do the remainder. Basically create what is the reverse of the mobi_index.py MobiIndex code and then allow whatever needs to to call it.

The smarts for creating the input to this routine would come from the ncx code, skel code, div code, etc but the code for writing any index (given the data and tag info) would be generic. Effectively we are separating the index writing and reading code from the index interpretation code since that changes depending on what type of index is being used.

Does this sound like the right approach? If so, any hints on how best to approach this would be welcome.

Ps: If it helps, the skelidx, dividx, and othidx all use one control byte as follows:

skelidx 0x0a
dividx 0x0f
othidx 0x03

So we have the tag table, and control byte value for each (and know the number of control bytes is always 1), perhaps we should be passing in the control byte to the generic index writer as well.

Last edited by KevinH; 03-13-2012 at 03:43 PM. Reason: added more info
KevinH is offline   Reply With Quote