MobileRead Forums - View Single Post

Trouhel · 10-21-2011, 10:15 AM

My apologies for the time it took me to answer your last posts. I will be pretty busy until next Wednesday and thus offline most of the time until then.

There's one point I didn't make clear enough with my first post, as I tried to be short: it's a document production tool, not a converter, that is it does not understand any HTML file, there are a few basic requirements that insures PDF as well as ePUB can be compiled. The most restricting ones being that <DIV> and the plain--without attributes--<CODE> tags are reserved for special use. <DIV> tags cannot includes other <DIV> tags. HTML attributes and CSS properties are also limited to a (fairly large) subset only, but that does not prevent processing, as what is not understood there is discarded. The reason for being is restrictive comes from the need to being able to produced PDF as well as ePUB from the same sources.

This said, as the script can re-use is XHTML output (step 1) as input, it actually accepts any HTML+CSS files that comply to its requirements: this is what I call "expert" mode, because it requires knowledge of both.
So that, to answer Toxaris, if either "filtered" HTML files fit in, or it'll be fairly simple to add the necessary code to process them to the script, it should work. However, from what I remember of Word produced HTML, hrefs in general (footnotes, cross-references, etc.) will probably mean problems. The other stumbling block is that, as strange as it may sounds, there is no way I can have access to a Word running computer.

Anyhow, I have prepared a sample of the documentation--that about "expert" mode, which deal with the above mentioned requirements--as Open Office generated .html and .doc files using fonts that should be available on any Windows flavour. Toxaris, i'll be glad to send them to you, but I can't find where to attach file when sending private messages through MobileRead. Should I post them through the forum ? If you can load the .docs into Word, save them as HTML, filter them and send me back the result, I could take a look. Next step would be to repeat the same process over a sample document with most of everything my script deals with (links, pictures, etc.) to get a better idea.

To answer Doitsu now, apart from the fact that credits should be given to the ConTeXt team, a professional publishing company that specialize into PDF, whose tool not only allows for producing quality documents, but also includes an extended version of the Lua virtual machine, that gives access to fonts through incorporating part of the FontForge code; as well as to the MobileRead forums, where I picked up many ideas:

1) the CSS file in the ePUB is stripped of line -breaks for size considerations. It is identical to the one produced at step 1, excepts that this one as line-breaks. I believe it was also included within the samples.

2) As far as I remember--I'll have to look back at the code to make sure, page breaks attribute/properties come from the OO style, so that the're up to the user. Now, as there is an "auxiliary" file that describes how to render the PDF, when the necessary data cannot be given via HTML, and it usually says chapters should start on a new odd page, this could easily be used to code enforcing page break into h2 tags.

3) mono spaced text in the samples are either let aligned <PRE> tags (code samples) or <SPAN> tags with a mono spaced font family property in the associated CSS style including stuff that does not hyphenate. Correct me if I'm wrong, but my understanding is that non wrapping text then is a reading system's behaviour which cannot be alleviated for via HTML.

I'll focus more on the basics at first, cleansing, debugging, and more general design, one of my concern being relevance of ePUB produced file to reading devices. I might get one soon--I'm waiting for Bookeen to release give a price tag to its Odyssey or Kobo Touch to be available here--but meanwhile, I'm as the dark as to how the will render on such devices.

Your comments are anyhow very useful and already gave me directions for improvements. If there actually a not too costly, as far as coding is concerned, way to get Word produced files to work as well, I do agree this would be quite a "plus".

Now, I guess it's going to be a couple of months at least to get an end-user version as I intend, but further comments are still welcomed.

10-21-2011, 10:15 AM	#7
Trouhel Enthusiast Posts: 26 Karma: 10 Join Date: Oct 2011 Device: none	My apologies for the time it took me to answer your last posts. I will be pretty busy until next Wednesday and thus offline most of the time until then. There's one point I didn't make clear enough with my first post, as I tried to be short: it's a document production tool, not a converter, that is it does not understand any HTML file, there are a few basic requirements that insures PDF as well as ePUB can be compiled. The most restricting ones being that <DIV> and the plain--without attributes--<CODE> tags are reserved for special use. <DIV> tags cannot includes other <DIV> tags. HTML attributes and CSS properties are also limited to a (fairly large) subset only, but that does not prevent processing, as what is not understood there is discarded. The reason for being is restrictive comes from the need to being able to produced PDF as well as ePUB from the same sources. This said, as the script can re-use is XHTML output (step 1) as input, it actually accepts any HTML+CSS files that comply to its requirements: this is what I call "expert" mode, because it requires knowledge of both. So that, to answer Toxaris, if either "filtered" HTML files fit in, or it'll be fairly simple to add the necessary code to process them to the script, it should work. However, from what I remember of Word produced HTML, hrefs in general (footnotes, cross-references, etc.) will probably mean problems. The other stumbling block is that, as strange as it may sounds, there is no way I can have access to a Word running computer. Anyhow, I have prepared a sample of the documentation--that about "expert" mode, which deal with the above mentioned requirements--as Open Office generated .html and .doc files using fonts that should be available on any Windows flavour. Toxaris, i'll be glad to send them to you, but I can't find where to attach file when sending private messages through MobileRead. Should I post them through the forum ? If you can load the .docs into Word, save them as HTML, filter them and send me back the result, I could take a look. Next step would be to repeat the same process over a sample document with most of everything my script deals with (links, pictures, etc.) to get a better idea. To answer Doitsu now, apart from the fact that credits should be given to the ConTeXt team, a professional publishing company that specialize into PDF, whose tool not only allows for producing quality documents, but also includes an extended version of the Lua virtual machine, that gives access to fonts through incorporating part of the FontForge code; as well as to the MobileRead forums, where I picked up many ideas: 1) the CSS file in the ePUB is stripped of line -breaks for size considerations. It is identical to the one produced at step 1, excepts that this one as line-breaks. I believe it was also included within the samples. 2) As far as I remember--I'll have to look back at the code to make sure, page breaks attribute/properties come from the OO style, so that the're up to the user. Now, as there is an "auxiliary" file that describes how to render the PDF, when the necessary data cannot be given via HTML, and it usually says chapters should start on a new odd page, this could easily be used to code enforcing page break into h2 tags. 3) mono spaced text in the samples are either let aligned <PRE> tags (code samples) or <SPAN> tags with a mono spaced font family property in the associated CSS style including stuff that does not hyphenate. Correct me if I'm wrong, but my understanding is that non wrapping text then is a reading system's behaviour which cannot be alleviated for via HTML. I'll focus more on the basics at first, cleansing, debugging, and more general design, one of my concern being relevance of ePUB produced file to reading devices. I might get one soon--I'm waiting for Bookeen to release give a price tag to its Odyssey or Kobo Touch to be available here--but meanwhile, I'm as the dark as to how the will render on such devices. Your comments are anyhow very useful and already gave me directions for improvements. If there actually a not too costly, as far as coding is concerned, way to get Word produced files to work as well, I do agree this would be quite a "plus". Now, I guess it's going to be a couple of months at least to get an end-user version as I intend, but further comments are still welcomed.