Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Formats > ePub

Notices

Reply
 
Thread Tools Search this Thread
Old 10-18-2011, 05:16 PM   #1
Trouhel
Enthusiast
Trouhel began at the beginning.
 
Posts: 25
Karma: 10
Join Date: Oct 2011
Device: none
ePUB + PDF creation script

Hello, MobileRead members,

This is my first post here. I wrote for my own personal use a (Lua) script that:

1. Converts, on a one to one basis, Open Office Writer "saved as HTML" documents to "clean" XHTML files plus a single shared CSS 2.1 style sheet. While new tags and styles are computed in the process, emphasis is on conformity to the original documents' layout. Most Open Office Writer's features such as "simple" tables, figures, footnotes, cross-references, are handled.

2. Builds and ePUB electronic books from the resulting XHTML files. XHTML components in the ePUB file may be compiled on a one to one basis or so as to always begin at the same title level and/or not exceeding a user-defined maximum size, while footnotes may be gathered and later flushed at the end of same level-parts, chapters, etc.-sections or at the end of the whole document.

3. Builds a TeX quality PDF documents from the same XHTML files, using ConTeXt's macro-instructions set and formatting engine. This printable version of the document brings in enhancements such as a table of contents, an index, a bibliography and bibliographical references, headers, footers and a better document structure, while focus is on "what goes into the ePUB goes into the PDF".

4. Has the ability to use XHTML files produced after the initial processing step as further input. They may be slightly modified by hand to benefit from features, such as limited CSS Font Module Level 3 compatibility, which cannot be accessed using Open Office Writer exclusively.
While it already works and has a full French and English user's guide that samples its features, and an installation notice, it is still home quality software, has known bugs, and, even worse documentation bugs.

I now wonder whether, amongst other more commonly used ePUB and PDF creation tools, there would be enough interest from potential users in a such a tool as open-source and free software, to get into the trouble of reworking it into maintainable and expendable code, which implies some work to rewrite parts of it, add comments and proof-read the documentation. Feedback and even design suggestions are thus welcomed.

Thanks for reading.
Trouhel is offline   Reply With Quote
Old 10-20-2011, 03:47 AM   #2
Doitsu
Wizard
Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.
 
Doitsu's Avatar
 
Posts: 1,959
Karma: 4633966
Join Date: Dec 2010
Device: Kindle PW2
Sounds interesting. Why don't you post a beta version of the script and/or some smaller sample source files and the corresponding epub and pdf files that you created with your script as attachments?
Doitsu is offline   Reply With Quote
 
Enthusiast
Old 10-20-2011, 07:07 AM   #3
Toxaris
Wizard
Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.
 
Toxaris's Avatar
 
Posts: 2,969
Karma: 3427611
Join Date: Oct 2009
Location: Heemskerk, NL
Device: PRS-300, PRS-T1
It would be even nicer if it could handle 'filtered HTML' files from Word.
Toxaris is offline   Reply With Quote
Old 10-20-2011, 09:22 AM   #4
Trouhel
Enthusiast
Trouhel began at the beginning.
 
Posts: 25
Karma: 10
Join Date: Oct 2011
Device: none
Hello Doitsu,

I will certainly be disappointing but I'm not into posting the script yet first of all because I yesterday found out the splitting feature mentioned in my point 1 is broken (this explains while splitting occurs in the middle of a chapter in the attached samples, and cross-linking between XHTML files as well) and I have no time to fix it quickly as I have guests coming home for a prolonged week-end in a few hours, next because it weight 200kB, even stripped of comments, carriage returns and unnecessary white spaces, and I don't feel like turning MobileRead members into beta testers, and last because at least a good half its documentation is requested reading to start using it.

I'm more into design as far as now and wondering if, considered there are other fairly good tools to produce ePUB, it's worth the trouble investing time into moving to an end-user quality product.

Anyhow, I've attached ePUB and PDF samples from the documentation + an XHTML and the CSS file produced at step, for easier reading. My way of testing is opening the file with ADE 1.8 (1.7 won't understand some @font-face rules whose family name is not the one in the font file) and check the ePUB with FlightCrew. I have no idea of what the result is going to be on a reader as I don't own one.

I hope this will anyhow give you a taste of what the script can do .(samples have no images for size considerations, but JPEG, PNG at least work, SVG is in the process of debugging).

Hi, Toxaris

I don't know what a Word "filtered HTML" file is. I don't have Word and very seldom runs Windows at all. The main constraint I have as to input files is CSS: i must look like the embedded CSS in OO files, which is pretty restrictive (@page, tag and tag.class rules only). It might be feasible, if "filtered HTML" is close to that, and if there are no idiosyncrasies in its use of attributes within HTML tags.

The whole idea, anyhow, is to have a fully free and open-source solution to produce both ePUB and PDF. What about loading .doc files into OO first ? Never tried, so I can't tell what's lost in the process.

Hope, I partly at least answered both of you.

Regards,

Trouhel
Attached Files
File Type: zip samples.zip (382.8 KB, 125 views)
Trouhel is offline   Reply With Quote
Old 10-20-2011, 09:39 AM   #5
Toxaris
Wizard
Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.
 
Toxaris's Avatar
 
Posts: 2,969
Karma: 3427611
Join Date: Oct 2009
Location: Heemskerk, NL
Device: PRS-300, PRS-T1
I never use OO (various reasons, not important here), so I am not sure what OO produces. However, if you want I can convert of your documents you use for testing to filtered HTML output from Word.
To be honest, currently I use a macro to produce clean HTML from Word. No stylesheet though, since I have a reusable stylesheet which I use for my ePUB's.
Toxaris is offline   Reply With Quote
Old 10-20-2011, 12:17 PM   #6
Doitsu
Wizard
Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.
 
Doitsu's Avatar
 
Posts: 1,959
Karma: 4633966
Join Date: Dec 2010
Device: Kindle PW2
I just looked at the output and it generally looks very professional. I just have some minor nitpicks. In the epub, I'd add
Code:
h2 { page-break-before: always }
to the css file to ensure that each new topic is displayed at the beginning when the user clicks on a TOC entry. I also found the .css file hard to read because of the missing line-breaks.
Apparently the script also has problems with code examples and mono-spaced fonts, because several long lines were not wrapped but truncated.

I also agree with Toxaris that it would be nice if the script would also be able to handle .html files created with other word processors such as MS Word.
Once you've cleaned up the code you should definitely set up a repository for your files.
Doitsu is offline   Reply With Quote
Old 10-21-2011, 10:15 AM   #7
Trouhel
Enthusiast
Trouhel began at the beginning.
 
Posts: 25
Karma: 10
Join Date: Oct 2011
Device: none
My apologies for the time it took me to answer your last posts. I will be pretty busy until next Wednesday and thus offline most of the time until then.

There's one point I didn't make clear enough with my first post, as I tried to be short: it's a document production tool, not a converter, that is it does not understand any HTML file, there are a few basic requirements that insures PDF as well as ePUB can be compiled. The most restricting ones being that <DIV> and the plain--without attributes--<CODE> tags are reserved for special use. <DIV> tags cannot includes other <DIV> tags. HTML attributes and CSS properties are also limited to a (fairly large) subset only, but that does not prevent processing, as what is not understood there is discarded. The reason for being is restrictive comes from the need to being able to produced PDF as well as ePUB from the same sources.

This said, as the script can re-use is XHTML output (step 1) as input, it actually accepts any HTML+CSS files that comply to its requirements: this is what I call "expert" mode, because it requires knowledge of both.
So that, to answer Toxaris, if either "filtered" HTML files fit in, or it'll be fairly simple to add the necessary code to process them to the script, it should work. However, from what I remember of Word produced HTML, hrefs in general (footnotes, cross-references, etc.) will probably mean problems. The other stumbling block is that, as strange as it may sounds, there is no way I can have access to a Word running computer.

Anyhow, I have prepared a sample of the documentation--that about "expert" mode, which deal with the above mentioned requirements--as Open Office generated .html and .doc files using fonts that should be available on any Windows flavour. Toxaris, i'll be glad to send them to you, but I can't find where to attach file when sending private messages through MobileRead. Should I post them through the forum ? If you can load the .docs into Word, save them as HTML, filter them and send me back the result, I could take a look. Next step would be to repeat the same process over a sample document with most of everything my script deals with (links, pictures, etc.) to get a better idea.

To answer Doitsu now, apart from the fact that credits should be given to the ConTeXt team, a professional publishing company that specialize into PDF, whose tool not only allows for producing quality documents, but also includes an extended version of the Lua virtual machine, that gives access to fonts through incorporating part of the FontForge code; as well as to the MobileRead forums, where I picked up many ideas:

1) the CSS file in the ePUB is stripped of line -breaks for size considerations. It is identical to the one produced at step 1, excepts that this one as line-breaks. I believe it was also included within the samples.

2) As far as I remember--I'll have to look back at the code to make sure, page breaks attribute/properties come from the OO style, so that the're up to the user. Now, as there is an "auxiliary" file that describes how to render the PDF, when the necessary data cannot be given via HTML, and it usually says chapters should start on a new odd page, this could easily be used to code enforcing page break into h2 tags.

3) mono spaced text in the samples are either let aligned <PRE> tags (code samples) or <SPAN> tags with a mono spaced font family property in the associated CSS style including stuff that does not hyphenate. Correct me if I'm wrong, but my understanding is that non wrapping text then is a reading system's behaviour which cannot be alleviated for via HTML.

I'll focus more on the basics at first, cleansing, debugging, and more general design, one of my concern being relevance of ePUB produced file to reading devices. I might get one soon--I'm waiting for Bookeen to release give a price tag to its Odyssey or Kobo Touch to be available here--but meanwhile, I'm as the dark as to how the will render on such devices.

Your comments are anyhow very useful and already gave me directions for improvements. If there actually a not too costly, as far as coding is concerned, way to get Word produced files to work as well, I do agree this would be quite a "plus".

Now, I guess it's going to be a couple of months at least to get an end-user version as I intend, but further comments are still welcomed.
Trouhel is offline   Reply With Quote
Old 03-05-2012, 06:57 AM   #8
Toxaris
Wizard
Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.
 
Toxaris's Avatar
 
Posts: 2,969
Karma: 3427611
Join Date: Oct 2009
Location: Heemskerk, NL
Device: PRS-300, PRS-T1
Any progress? I would be very interested. Perhaps it would be nice if it give a nice option to build a PDF from an ePUB. You already built an ePUB from the XHTML files, so perhaps that is an option.
Toxaris is offline   Reply With Quote
Old 03-05-2012, 05:11 PM   #9
Trouhel
Enthusiast
Trouhel began at the beginning.
 
Posts: 25
Karma: 10
Join Date: Oct 2011
Device: none
Work in progress

Thanks for your ongoing interest despite my being very late on my schedule.
What happened is I decided to redesign the specifications so as to be able to accommodate any HTML files as long as it has been converted (i.e. from .doc, .odt, ...) by Open Office which meant dropping requirement on HTML tagging and providing the required data through command line options, which meant adding command line options, then added still more options and ended up with a tool badly in need of a GUI.

So, it's still a script but also has a GUI module to a) define the project data (files, options) b) save that data as a "project file" and the files to a "project folder" c) launch the script, which can still be used as a standalone command line tool, and report its progress and completion. The HTML->XHTML->ePUB code has also been fully rewritten to facilitate latter maintenance.

Add the fact that my phone line and--access to the Internet and to much needed documentation--was down for a month due to a storm: development thus took longer than expected. At the moment I have:

1) the GUI
2) the ePUB production process
and their interaction

(almost, it's a matter of days now to fix non-critical bugs) working on both Unix and Windows (I've built up to 10Mbytes ePUB files and tested them on my new Kobo)

next steps are:

1) thoroughly testing a) every value for ePUB options b) correct working of error handling and error messages
2) "plugging-in" the PDF composition code (strange as it may seems, this should be the easiest part: that part of the code has a longer history, has already been tested, and should not require much changes)
3) updating the documentation, which is time consuming task, as English is not my native language.

So, work is well on it's way, but with spring coming, I'll have outdoor occupations as well and its pace will slow a bit. As I don't want to give out an unfinished job, not wanting to develop and maintain at the same time, it unfortunately might take, let's say (being optimist) a couple more months.

As for building a PDF from ePUB, here's how the tool work :

a) it converts HTML or XHTML files + internal or external CSS to clean XHTML files, on a one to one basis
b) it builds ePUB for the converted XHTML files
c) it builds ConTeXt source code from the same XHTML files
d) it calls ConTeXt (TeX based, as LaTeX) to compose them into a PDF file

steps b), c) and d) are "on demand"; converted XHTML+CSS+required graphics and fonts, from step a) can be either archived--and thus opened with a browser--or deleted as well as the ConTeXt files from step c)

There are almost no requirements on HTML tags in the input (no <DIV> save if <BODY> as parent, and that's all as far as I remember) but CSSs are restricted to tag {..} and tag.class {...} rules + @page and @font-face rules. "th p" and "td p" rules are also understood, but that's all. This is what you find in Open Office Writer HTML files internal CSS. There are also non-blocking--i.e., what is not understood is discarded--requirements on attributes/properties which comes from the need to translate them to PDF layouts equivalents.

So, basically, unzipping the ePUB files, using its XHTML, CSS, JPeG and PNG files as input and doing nothing but composing a PDF document will work if the above requirements are met. ePUB to PDF "out-of-the-box" could of course be added (mostly a matter of extracting files to a temporary folder), but I'd rather wait until the tool 1) is ready to be "released" 2) is found easy enough to use 3) and find out to what use it is put. The basic idea being of course to provide a free self-publishing tool which requires but knowledge of either HTML or a word processor and produces both reasonable "quality" ePUB and PDF, but software often ends up being used for other means than originally intended.

Please be patient. When the tool is "out", it'll be easier and I'll be more than willing to discuss and work on design of existing and new features, options, etc. In the meanwhile, I'll give notice on work progress from time to time.
Trouhel is offline   Reply With Quote
Old 03-28-2012, 10:39 AM   #10
Trouhel
Enthusiast
Trouhel began at the beginning.
 
Posts: 25
Karma: 10
Join Date: Oct 2011
Device: none
As I said I'll let you know where I'm at, I've now completed :

1. GUI module (calls the command line tool)
2. conversion to XHTML and assembly into ePUB
3. a custom installer for Windows
4. The french version of a User's Guide
5. an automated post-install test procedure (rebuilds the documentation

TODO list still has :

1. PDF composition
2. english User's Guide

so that next post might well be the "release" one.
Trouhel is offline   Reply With Quote
Old 06-15-2012, 02:04 PM   #11
Trouhel
Enthusiast
Trouhel began at the beginning.
 
Posts: 25
Karma: 10
Join Date: Oct 2011
Device: none
It took longer than expected as I lacked time, but it's finally uploaded to git-hub at :

https://github.com/fhaby/whatever
Trouhel is offline   Reply With Quote
Old 06-15-2012, 02:10 PM   #12
JSWolf
Resident Curmudgeon
JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.
 
JSWolf's Avatar
 
Posts: 37,284
Karma: 18156082
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Sony Reader PRS-650, iPad, nook STR
You seriously need to find a different site to host your files, the site will not allow me to download the Windows exe.
JSWolf is offline   Reply With Quote
Old 06-15-2012, 02:38 PM   #13
Trouhel
Enthusiast
Trouhel began at the beginning.
 
Posts: 25
Karma: 10
Join Date: Oct 2011
Device: none
I did clone the repository (on a Unix machine) a few minutes ago, copied the .exe to Windows and tested it to make sure it was rightly uploaded. I never tried it but there seem to be a Windows git tool at http://msysgit.github.com/. I'm gonna give a try and let you know if it worked after a few minutes.
Trouhel is offline   Reply With Quote
Old 06-15-2012, 03:23 PM   #14
PeterT
Taking a break; Fed up
PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.
 
PeterT's Avatar
 
Posts: 6,842
Karma: 43933696
Join Date: Nov 2007
Location: Toronto
Device: Wife: Touch, Arc, Vox Me: Nexus 7, Glo
But the last thing I (and most other users) want to do is to install a GIT client to download the executable.....
PeterT is offline   Reply With Quote
Old 06-15-2012, 03:53 PM   #15
Trouhel
Enthusiast
Trouhel began at the beginning.
 
Posts: 25
Karma: 10
Join Date: Oct 2011
Device: none
Works. I've downloaded Git-1.7.10-preview20120409.exe from Google Code, installed with defaults setup options, and started it successfully. It opens a bash shell window where you have but to type the following command:

git clone https://github.com/fhaby/whatever

It will download a copy of the repository. You'll need to read the "Installation" chapter in the documentation included in the "whatever-docs.zip" file before running the Windows exe (it will complain about Lua not being installed if you run it straight out-of-the-box). I've updated the README file to warn about that.

Actually there's a few open source tools (Lua for Windows, WxLua, TeXLua or ConTeXt, and Info-Zip) that are required depending on whether you intend to build ePUB or PDF, so you should know about that before installing.

As for Git Hub, this is a deliberate choice over say sourceforge, over whose copyright policy I have questions , I don't intend to move somewhere else for the time being.
Trouhel is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Script for converting ePUB to PDF using Prince Jellby ePub 37 12-26-2013 10:55 AM
Création d'un script de conversion automatique sur Internet ODT-to-EPUB nixSta Software 4 07-15-2011 03:09 AM
PDF to EPUB batch/script conversion ? smallhagrid Conversion 5 06-14-2011 06:33 PM
Epub/pdf books creation from doc/rtf service stasys Workshop 2 05-23-2011 12:26 PM
Converting a film script (PDF) to EPUB with Calibre alanjay Calibre 7 10-19-2010 10:41 AM


All times are GMT -4. The time now is 02:54 PM.


MobileRead.com is a privately owned, operated and funded community.