View Full Version : ePUB + PDF creation script


Trouhel
10-18-2011, 05:16 PM
Hello, MobileRead members,

This is my first post here. I wrote for my own personal use a (Lua) script that:

1. Converts, on a one to one basis, Open Office Writer "saved as HTML" documents to "clean" XHTML files plus a single shared CSS 2.1 style sheet. While new tags and styles are computed in the process, emphasis is on conformity to the original documents' layout. Most Open Office Writer's features such as "simple" tables, figures, footnotes, cross-references, are handled.

2. Builds and ePUB electronic books from the resulting XHTML files. XHTML components in the ePUB file may be compiled on a one to one basis or so as to always begin at the same title level and/or not exceeding a user-defined maximum size, while footnotes may be gathered and later flushed at the end of same level-parts, chapters, etc.-sections or at the end of the whole document.

3. Builds a TeX quality PDF documents from the same XHTML files, using ConTeXt's macro-instructions set and formatting engine. This printable version of the document brings in enhancements such as a table of contents, an index, a bibliography and bibliographical references, headers, footers and a better document structure, while focus is on "what goes into the ePUB goes into the PDF".

4. Has the ability to use XHTML files produced after the initial processing step as further input. They may be slightly modified by hand to benefit from features, such as limited CSS Font Module Level 3 compatibility, which cannot be accessed using Open Office Writer exclusively.
While it already works and has a full French and English user's guide that samples its features, and an installation notice, it is still home quality software, has known bugs, and, even worse documentation bugs.

I now wonder whether, amongst other more commonly used ePUB and PDF creation tools, there would be enough interest from potential users in a such a tool as open-source and free software, to get into the trouble of reworking it into maintainable and expendable code, which implies some work to rewrite parts of it, add comments and proof-read the documentation. Feedback and even design suggestions are thus welcomed.

Thanks for reading.

Doitsu
10-20-2011, 03:47 AM
Sounds interesting. Why don't you post a beta version of the script and/or some smaller sample source files and the corresponding epub and pdf files that you created with your script as attachments?

Toxaris
10-20-2011, 07:07 AM
It would be even nicer if it could handle 'filtered HTML' files from Word.

Trouhel
10-20-2011, 09:22 AM
Hello Doitsu,

I will certainly be disappointing but I'm not into posting the script yet first of all because I yesterday found out the splitting feature mentioned in my point 1 is broken (this explains while splitting occurs in the middle of a chapter in the attached samples, and cross-linking between XHTML files as well) and I have no time to fix it quickly as I have guests coming home for a prolonged week-end in a few hours, next because it weight 200kB, even stripped of comments, carriage returns and unnecessary white spaces, and I don't feel like turning MobileRead members into beta testers, and last because at least a good half its documentation is requested reading to start using it.

I'm more into design as far as now and wondering if, considered there are other fairly good tools to produce ePUB, it's worth the trouble investing time into moving to an end-user quality product.

Anyhow, I've attached ePUB and PDF samples from the documentation + an XHTML and the CSS file produced at step, for easier reading. My way of testing is opening the file with ADE 1.8 (1.7 won't understand some @font-face rules whose family name is not the one in the font file) and check the ePUB with FlightCrew. I have no idea of what the result is going to be on a reader as I don't own one.

I hope this will anyhow give you a taste of what the script can do .(samples have no images for size considerations, but JPEG, PNG at least work, SVG is in the process of debugging).

Hi, Toxaris

I don't know what a Word "filtered HTML" file is. I don't have Word and very seldom runs Windows at all. The main constraint I have as to input files is CSS: i must look like the embedded CSS in OO files, which is pretty restrictive (@page, tag and tag.class rules only). It might be feasible, if "filtered HTML" is close to that, and if there are no idiosyncrasies in its use of attributes within HTML tags.

The whole idea, anyhow, is to have a fully free and open-source solution to produce both ePUB and PDF. What about loading .doc files into OO first ? Never tried, so I can't tell what's lost in the process.

Hope, I partly at least answered both of you.

Regards,

Trouhel

Toxaris
10-20-2011, 09:39 AM
I never use OO (various reasons, not important here), so I am not sure what OO produces. However, if you want I can convert of your documents you use for testing to filtered HTML output from Word.
To be honest, currently I use a macro to produce clean HTML from Word. No stylesheet though, since I have a reusable stylesheet which I use for my ePUB's.

Doitsu
10-20-2011, 12:17 PM
I just looked at the output and it generally looks very professional. I just have some minor nitpicks. In the epub, I'd add h2 { page-break-before: always } to the css file to ensure that each new topic is displayed at the beginning when the user clicks on a TOC entry. I also found the .css file hard to read because of the missing line-breaks.
Apparently the script also has problems with code examples and mono-spaced fonts, because several long lines were not wrapped but truncated.

I also agree with Toxaris that it would be nice if the script would also be able to handle .html files created with other word processors such as MS Word.
Once you've cleaned up the code you should definitely set up a repository for your files.

Trouhel
10-21-2011, 10:15 AM
My apologies for the time it took me to answer your last posts. I will be pretty busy until next Wednesday and thus offline most of the time until then.

There's one point I didn't make clear enough with my first post, as I tried to be short: it's a document production tool, not a converter, that is it does not understand any HTML file, there are a few basic requirements that insures PDF as well as ePUB can be compiled. The most restricting ones being that <DIV> and the plain--without attributes--<CODE> tags are reserved for special use. <DIV> tags cannot includes other <DIV> tags. HTML attributes and CSS properties are also limited to a (fairly large) subset only, but that does not prevent processing, as what is not understood there is discarded. The reason for being is restrictive comes from the need to being able to produced PDF as well as ePUB from the same sources.

This said, as the script can re-use is XHTML output (step 1) as input, it actually accepts any HTML+CSS files that comply to its requirements: this is what I call "expert" mode, because it requires knowledge of both.
So that, to answer Toxaris, if either "filtered" HTML files fit in, or it'll be fairly simple to add the necessary code to process them to the script, it should work. However, from what I remember of Word produced HTML, hrefs in general (footnotes, cross-references, etc.) will probably mean problems. The other stumbling block is that, as strange as it may sounds, there is no way I can have access to a Word running computer.

Anyhow, I have prepared a sample of the documentation--that about "expert" mode, which deal with the above mentioned requirements--as Open Office generated .html and .doc files using fonts that should be available on any Windows flavour. Toxaris, i'll be glad to send them to you, but I can't find where to attach file when sending private messages through MobileRead. Should I post them through the forum ? If you can load the .docs into Word, save them as HTML, filter them and send me back the result, I could take a look. Next step would be to repeat the same process over a sample document with most of everything my script deals with (links, pictures, etc.) to get a better idea.

To answer Doitsu now, apart from the fact that credits should be given to the ConTeXt team, a professional publishing company that specialize into PDF, whose tool not only allows for producing quality documents, but also includes an extended version of the Lua virtual machine, that gives access to fonts through incorporating part of the FontForge code; as well as to the MobileRead forums, where I picked up many ideas:

1) the CSS file in the ePUB is stripped of line -breaks for size considerations. It is identical to the one produced at step 1, excepts that this one as line-breaks. I believe it was also included within the samples.

2) As far as I remember--I'll have to look back at the code to make sure, page breaks attribute/properties come from the OO style, so that the're up to the user. Now, as there is an "auxiliary" file that describes how to render the PDF, when the necessary data cannot be given via HTML, and it usually says chapters should start on a new odd page, this could easily be used to code enforcing page break into h2 tags.

3) mono spaced text in the samples are either let aligned <PRE> tags (code samples) or <SPAN> tags with a mono spaced font family property in the associated CSS style including stuff that does not hyphenate. Correct me if I'm wrong, but my understanding is that non wrapping text then is a reading system's behaviour which cannot be alleviated for via HTML.

I'll focus more on the basics at first, cleansing, debugging, and more general design, one of my concern being relevance of ePUB produced file to reading devices. I might get one soon--I'm waiting for Bookeen to release give a price tag to its Odyssey or Kobo Touch to be available here--but meanwhile, I'm as the dark as to how the will render on such devices.

Your comments are anyhow very useful and already gave me directions for improvements. If there actually a not too costly, as far as coding is concerned, way to get Word produced files to work as well, I do agree this would be quite a "plus".

Now, I guess it's going to be a couple of months at least to get an end-user version as I intend, but further comments are still welcomed.

Toxaris
03-05-2012, 06:57 AM
Any progress? I would be very interested. Perhaps it would be nice if it give a nice option to build a PDF from an ePUB. You already built an ePUB from the XHTML files, so perhaps that is an option.

Trouhel
03-05-2012, 05:11 PM
Thanks for your ongoing interest despite my being very late on my schedule.
What happened is I decided to redesign the specifications so as to be able to accommodate any HTML files as long as it has been converted (i.e. from .doc, .odt, ...) by Open Office which meant dropping requirement on HTML tagging and providing the required data through command line options, which meant adding command line options, then added still more options and ended up with a tool badly in need of a GUI.

So, it's still a script but also has a GUI module to a) define the project data (files, options) b) save that data as a "project file" and the files to a "project folder" c) launch the script, which can still be used as a standalone command line tool, and report its progress and completion. The HTML->XHTML->ePUB code has also been fully rewritten to facilitate latter maintenance.

Add the fact that my phone line and--access to the Internet and to much needed documentation--was down for a month due to a storm: development thus took longer than expected. At the moment I have:

1) the GUI
2) the ePUB production process
and their interaction

(almost, it's a matter of days now to fix non-critical bugs) working on both Unix and Windows (I've built up to 10Mbytes ePUB files and tested them on my new Kobo)

next steps are:

1) thoroughly testing a) every value for ePUB options b) correct working of error handling and error messages
2) "plugging-in" the PDF composition code (strange as it may seems, this should be the easiest part: that part of the code has a longer history, has already been tested, and should not require much changes)
3) updating the documentation, which is time consuming task, as English is not my native language.

So, work is well on it's way, but with spring coming, I'll have outdoor occupations as well and its pace will slow a bit. As I don't want to give out an unfinished job, not wanting to develop and maintain at the same time, it unfortunately might take, let's say (being optimist) a couple more months.

As for building a PDF from ePUB, here's how the tool work :

a) it converts HTML or XHTML files + internal or external CSS to clean XHTML files, on a one to one basis
b) it builds ePUB for the converted XHTML files
c) it builds ConTeXt source code from the same XHTML files
d) it calls ConTeXt (TeX based, as LaTeX) to compose them into a PDF file

steps b), c) and d) are "on demand"; converted XHTML+CSS+required graphics and fonts, from step a) can be either archived--and thus opened with a browser--or deleted as well as the ConTeXt files from step c)

There are almost no requirements on HTML tags in the input (no <DIV> save if <BODY> as parent, and that's all as far as I remember) but CSSs are restricted to tag {..} and tag.class {...} rules + @page and @font-face rules. "th p" and "td p" rules are also understood, but that's all. This is what you find in Open Office Writer HTML files internal CSS. There are also non-blocking--i.e., what is not understood is discarded--requirements on attributes/properties which comes from the need to translate them to PDF layouts equivalents.

So, basically, unzipping the ePUB files, using its XHTML, CSS, JPeG and PNG files as input and doing nothing but composing a PDF document will work if the above requirements are met. ePUB to PDF "out-of-the-box" could of course be added (mostly a matter of extracting files to a temporary folder), but I'd rather wait until the tool 1) is ready to be "released" 2) is found easy enough to use 3) and find out to what use it is put. The basic idea being of course to provide a free self-publishing tool which requires but knowledge of either HTML or a word processor and produces both reasonable "quality" ePUB and PDF, but software often ends up being used for other means than originally intended.

Please be patient. When the tool is "out", it'll be easier and I'll be more than willing to discuss and work on design of existing and new features, options, etc. In the meanwhile, I'll give notice on work progress from time to time.

Trouhel
03-28-2012, 10:39 AM
As I said I'll let you know where I'm at, I've now completed :

1. GUI module (calls the command line tool)
2. conversion to XHTML and assembly into ePUB
3. a custom installer for Windows
4. The french version of a User's Guide
5. an automated post-install test procedure (rebuilds the documentation

TODO list still has :

1. PDF composition
2. english User's Guide

so that next post might well be the "release" one.

Trouhel
06-15-2012, 02:04 PM
It took longer than expected as I lacked time, but it's finally uploaded to git-hub at :

https://github.com/fhaby/whatever

JSWolf
06-15-2012, 02:10 PM
You seriously need to find a different site to host your files, the site will not allow me to download the Windows exe.

Trouhel
06-15-2012, 02:38 PM
I did clone the repository (on a Unix machine) a few minutes ago, copied the .exe to Windows and tested it to make sure it was rightly uploaded. I never tried it but there seem to be a Windows git tool at http://msysgit.github.com/. I'm gonna give a try and let you know if it worked after a few minutes.

PeterT
06-15-2012, 03:23 PM
But the last thing I (and most other users) want to do is to install a GIT client to download the executable.....

Trouhel
06-15-2012, 03:53 PM
Works. I've downloaded Git-1.7.10-preview20120409.exe from Google Code, installed with defaults setup options, and started it successfully. It opens a bash shell window where you have but to type the following command:

git clone https://github.com/fhaby/whatever

It will download a copy of the repository. You'll need to read the "Installation" chapter in the documentation included in the "whatever-docs.zip" file before running the Windows exe (it will complain about Lua not being installed if you run it straight out-of-the-box). I've updated the README file to warn about that.

Actually there's a few open source tools (Lua for Windows, WxLua, TeXLua or ConTeXt, and Info-Zip) that are required depending on whether you intend to build ePUB or PDF, so you should know about that before installing.

As for Git Hub, this is a deliberate choice over say sourceforge, over whose copyright policy I have questions , I don't intend to move somewhere else for the time being.

PeterT
06-15-2012, 04:48 PM
You seriously need to find a different site to host your files, the site will not allow me to download the Windows exe.

One "trick" that did work... at the top of the https://github.com/fhaby/whatever page is a selection ZIP; this will download the repository in a ZIP file so you get not only the Windows EXE file but the other 3 files as well all in a zip file....

Trouhel
06-15-2012, 05:23 PM
Thanks for trying.
I then took a closer look, there's actually also a "Downloads" button which will allow files to be downloaded as soon as I've uploaded the individual files there (takes some time).

Trouhel
06-15-2012, 06:14 PM
Upload completed.
Individual files are now available at: https://github.com/fhaby/whatever/downloads
Make sure to download the docs archive and read at least the Installation chapter before anything else.

roger64
06-16-2012, 12:11 AM
An impressive work. Congratulations!

I will try to test it if I manage to install needed dependencies (Debian).

Trouhel
06-18-2012, 02:19 PM
Uploaded a new version that:

1) Adds a short on-line help to the installer,
2) Corrects bugs when version of Open Office:

a) uses TD instead of TH in THEAD
b) uses P styles for cells instead of TH P, TD P

(both seem to be the case on Windows).

If anyone tries--or even succeeds--using it, let me know : I won't bother keeping the repository up if inactive for too long.

Toxaris
06-19-2012, 02:34 AM
Will check it out somewhere this week.

Trouhel
06-19-2012, 04:01 PM
Did you download today ?

I actually found out the last uploaded version is broken when it comes to tables (doesn't comply to XHTML strict any more in some cases). Fixed it but
had no time to upload yet (I'll do that by tomorrow night and post when done).

I'm thinking of adding an update feature to the installer so that an update would simply mean downloading either the zipped sources or an update zip file
and launching the installer again instead of downloading of the whole 17.5M installer.

Toxaris
06-20-2012, 06:47 AM
I will download it as soon as I have time to test. I did scan through the manual and the required programs is quite a list.
Anyway, it will probably not be before friday.

Trouhel
06-21-2012, 09:17 AM
Well, that's one of my main concern : that it proves too complex and prevents one from trying. Under Windows at least, the installer should eases things a little bit.

The new version has been uploaded (fixes the tables bug).

It introduces the following changes :

a) Windows self-extracting archive is replaced by an installer/updater.
The sources and documentation zipped files must be downloaded separately.
This reduces both uploads and upgrades bandwidth's consumption.

b) Support information has been modified.
The previous was a bad choice and proved too spam friendly.

Toxaris
06-25-2012, 03:41 AM
I tried to install it this weekend, but the installation procedure is too complex. Instead that the installer warns you that you need to install certain applications/tools, it just gives an error and stops. You then need to figure out what you need additionally.
Also, I need to install the complete LUA. Isn't there a runtime or something instead of the whole suite?

Trouhel
06-25-2012, 08:53 AM
a) Yes. I figured out that bug this weekend. Actually the installer stops instead of telling you about missing software, only because of a wrongly formatted error message. Fixed, but not uploaded yet.

b) Lua for Windows is a runtime which adds needed libraries to the basic virtual machine : the only other way to get them is to compile yourself. Everything needed is copied by the installer, so you can uninstall Lua for Windows afterward (Lua itself is not needed).

c) as for other softwares, you need :

- wxLua and the ConTeXt installer in any case
- Info-Zip if you checked the ePUB button

So that's three more zip file yous need to have in your "Downloads" folder, which might also be deleted once install is complete.

Using the tool is not going to be an "run as soon as possible" matter either, but the learning curve is really nothing compared to becoming acquainted with ConTeXt (or LaTeX for that matter).

There's also definitely be bugs showing out : that's hardly avoidable within more than 25.000 lines of software.

If the PDF button is checked, the installer will download and install ConTeXt, which is about 200Mb, for you : obviously this will take some time.

This is actually the easiest way I could figure out to get the necessary software on Windows.

To sum up (it's detailed in the doc) :

a) have :

LuaForWindows_v5.1.4-45.exe
whatever-win32.exe
context-setup-mswin.zip
wxLua-2.8.10-MSW-bin.zip
zip300xn.zip (if ePUB selected)
whatever-full.zip
whatever-docs.zip (optional, but recommended as this will allow testing the installation)

in your %HOMEDRIVE%%HOMEPATH%\Downloads folder (all together that is somewhere around 50Mb)

b) install Lua For Windows where it propose you (this is either C:\Program Files (x86)\Lua\5.1 or C:\Program Files\Lua\5.1)
c) run the installer, it should now work.

Hitch
06-29-2012, 09:13 PM
I tried to install it this weekend, but the installation procedure is too complex. Instead that the installer warns you that you need to install certain applications/tools, it just gives an error and stops. You then need to figure out what you need additionally.
Also, I need to install the complete LUA. Isn't there a runtime or something instead of the whole suite?

Tox:

If you try this, will you let me know?

Hitch

Toxaris
06-30-2012, 02:48 AM
I did it once, but then I stopped. I haven't tried for the second time yet. Once I get around to it, I will let you all know.

Hitch
06-30-2012, 04:47 AM
Thx. As you know, I'm always interested in new tools--but the brain-damage of installing this if it isn't the "bee's knees" makes me want to get input from someone I trust, first.

Hitch

Trouhel
07-06-2012, 04:23 PM
Uploaded a new version with a Windows installer that downloads/installs all third party software. Absolute prerequisite is to have a C:\Windows\System32\msvcrt.dll present on your computer (this is the basic Visual C++ runtime shared library many other software uses as well): despite it being called "redistributable", its license actually forbids that.

Getting that to work has been very time consuming, for what amounts, IMHO, to almost nothing. I've conduced enough tests to conclude that :

a) Installing, running the test script as described in the documentation, and rebuilding the documentation using the GUI works on both Windows and Unix. It takes 1/4 hour, all downloads files included, plus nothing but GUI config files, that have to be in your home folder, is ever installed/modified outside the installation folder.

b) On non-Windows OS, probably including Mac OS/X, installing dependencies from sources when binaries/packages are not available can be a drag, especially when it comes to wxLua (every thing else should be almost effortless), unless you are able to tackle with "configure" scripts and makefiles (I actually had to, so, given I do not even run Linux, I assume it is feasible in most cases, but I unfortunately cannot be of any help there.)

c) Building ePUB and PDF from .odt, or .(x)html compliant to the specs, works, at least in most case, and most certainly can be made to work in any case: the documentation itself being as complex as can be, it will remain, for the time being, the measurement unit in that matter.

d) .doc to .odt to .html using Open Office must be handled with care : it will more often that not loose/damage formatting information. That may be either harmless or break the formatting process.

This is more than enough as far as I am concerned, so I'll leave it at that for a while and give myself a break.

"provided 'as is', without warranty of any kind", as the saying goes.

Trouhel
07-28-2012, 09:02 AM
in case someone someday is interested enough to give it a try

new version uploaded that fixes previously mentioned issues:

1) works out-of-the box with MS-Word ("how-to" as sample included in the documentation)
2) provides a FreeBSD package to install script and all dependencies via the ports collection (fairly easier than looked like at first glance). Provides every hints needed to port on any wxLUa/ConTeXt compatible OS

plus (from the README file):

* support for MS Word floated graphics (and a new "sample" annex to the documentation on how-to use with MS Word)
* partial support for:
* Complex Text Layout scripts (works with ConTeXt, relies on the ePUB reading system ability to use "Open Type" lookups: not the case with any ADE based ones)
* on-demand "Open Type" lookup tables (such as "Old Style" numbers, ligatures, etc.) for other scripts (PDF only)
* Right-to-Left scripts in paragraphs (mixed bi-directional text layout still relies on the rendering engine: i.e. wrong in the ConTeXt generated PDF, if directionality (i.e. "lang") is not explicitly and carefully set in the source documents (which is not available when using Microsoft Word and/or Open Office Writer)

roadmap for next(s) version(s) (if any):

a) user control over microtypography via GUI

b) integration of ICU BiDir and Layout Engine -- if technically feasible -- in any case, will be provided separately as (C++) sources only, if it works
: will give access to a) correct ordering of glyphs b) true set of glyphs (=far more efficient font subsetting) used for a few RTL and CTL scripts. Won
't clear every issue (character positioning) with every CTL scripts anyhow: some cannot be handled except inside the rendering engines. Will, to the best of my knowledge, do the job for Arabic scripts, not Devanagari, and maybe other Hindic scripts

c) Arabic as a main language (i.e. full Arabic projects, including titles, ToC, etc. + Latin as far as ConTeXt permits)

d) try to find a way to do Devanagari (resorting to ConTexT MkII/XeTeX instead of ConTexT MkIV/LuaTeX, while poorer when it comes to microtypography, may be one)