View Single Post
Old 06-10-2011, 06:25 AM   #7
thydere
Member
thydere began at the beginning.
 
Posts: 15
Karma: 10
Join Date: Nov 2007
Location: Germany
Device: Sony PRS-300
Quote:
Originally Posted by HarryT View Post
I guess that all you can really do in that case is look for suitable points at which to split the file. You could split immediately after an image, or immediately before <Hx> tags, for example.
Along the way I probably have to semantically analyze the html content and do something like this, though I will drop that issue for the moment until it really becomes a problem and I have more real world test cases on which to decide how to proceed.

Fortunately I do not want to create a general purpose epub creation program, but a backend library that is intended to be glued to a frontend document parser. The difference is that while a converter like calibre or stanza has to recreate/guess the document structure (with a little help of the user in calibres case), I expect the already created structure together with the sectionized content. The postprocessing work from that point is relative simple: just create the html/toc/stylesheet/image/whatchamacalit files making up the oebps part of the epub from the document structure.

The big work lies mostly with the front end and the processing pipeline in the middle. It takes a text document, runs it through the appropriate parser (markdown in my case, but thats relatively exchangeable as long as there's html + processing instructions at the end), then parses the resulting html looking for xinclude / xml preprocessing directives which describe the further processing of the document (including external sections into the text, resizing images to fit the proper resolution, create images/graphs from inline definitions, include references, run some external program and include the result, cook coffee, whatever). This process (hopefully) generated a plethora of information about the content of the files which will essentially result in the structural metadata which is used by the epub backend to create the ebook - and give it pointers on where exactly to cut the text to pieces.

What that means is that I try to solve the issue by declaring it to be the problem of the person writing the front end parser (uhm... which will actually be me again - I knew there was a hole in my theory ).

Which doesn't mean that html files cannot be preprocessed and used as input - in fact for my first prototype i used a simple xhtml frontend that works similar to what you proposed (creating a content tree by parsing for the hx elements, copying over the dc elements and adapting those that differ in their epub form, ...), tried it on some of the html-ized ebooks in my collection and got some nice results out of it.


Once again, thank you both for your input .
thydere is offline   Reply With Quote