Prince XML for creating mobile reader-sized PDFs? - Page 2

frabjous · 09-14-2009, 06:12 PM

Quote:

Originally Posted by Jellby

This is one of the things that need some work, at the moment I process the spine with sed scripts, which rely on "correct" newlines (that's why I needed dos2unix in some cases). Ideally, the .opf file should be processed with some XML tool, do you know any?

I'm not at all experienced in any such things (I just like to pretend that I am), but maybe something like XML starlet? Of course, perl or python probably have libraries for it.

I was going to say I didn't think there was anything wrong with using sed though... but upon further reflection, I realized it is rather dangerous if some of the entries in the .opf or .ncx file have linebreaks in the middle of a tag or element.

E.g., to test this, I made an epub with an .opf that had a part looked like this:

Code:

<item href="titlepage.xhtml" 
id="titlepage" media-type="application/xhtml+xml"/>
<item href="test.html" id="html" 
media-type="application/xhtml+xml"/>

rather than this:

Code:

<item href="titlepage.xhtml" id="titlepage" media-type="application/xhtml+xml"/>
<item href="test.html" id="html" media-type="application/xhtml+xml"/>

Running your script generated errors such as:

Code:

prince: ./:1: error: Document is empty
prince: ./:1: error: Start tag expected, '<' not found
prince: ./: error: could not load input file

I don't think a well-made .opf would look like that, however, and FWIW, ADE can choke on stuff like this too.

Quote:

The standard .epub settings (not those in the "special" pdf-style file) can be overriden by adding !important to the default.css file, at least according to the documentation. I could add another option to specify highes-priority rules (it would be just adding another .css after the book-specific one in the prince command-line).

Playing around with this, it sort of works. E.g., I only change your default.css to make:

Code:

body {
  font-size: 9.9pt;
  font-family: serif; 
  text-align: justify;
  prince-image-resolution: 166dpi;
  hyphens: auto;
}

into:

Code:

body {
  font-size: 9.9pt;
  font-family: serif !important; 
  text-align: justify;
  prince-image-resolution: 166dpi;
  hyphens: auto;
}

I then took this following simple HTML (test.html) file for testing:

Code:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1" />
<meta name="author" content="frabjous" />
<meta name="title" content="Prince Test" />
<title>Prince Test</title>
<style type="text/css">
body { font-family: Georgia; }
</style>
</head>
<body>
<p>The quick brown fox jumps over the lazy dog. 0123456789</p>
</body>
</html>

I then ran (calibre):

Code:

ebook-convert test.html test.epub
epub2pdf.sh test.epub test.pdf

The resulting PDF used Droid, as per default.css, not Georgia. However, if I start instead with:

Code:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1" />
<meta name="author" content="frabjous" />
<meta name="title" content="Prince Test" />
<title>Prince Test</title>
<style type="text/css">
.mypararagraph { font-family: Georgia; }
</style>
</head>
<body>
<p class="myparagraph">The quick brown fox jumps over the lazy dog. 0123456789</p>
</body>
</html>

Then the PDF used Georgia, not Droid, so the !important flag is not "cascading down" so-to-speak.

(I know calibre mucks with the css in conversion to epub, but I got the same results using prince directly on the html files directly with the "-s ~/.epub2pdf/default.css" option.)

Again, not a huge deal since the usual place for the font specification would be under "body", and if something further down changed it, it's probably got a good reason-- I mainly worry about epubs made with WYSIWYG editors and suchlike that might place the the font-family attribute anywhere.

Quote:

Yes, feel free to code it

When I have more free time, I might try it, though I have almost no experience with Python myself. Still, this is the best way to learn, eh?

Quote:

Oooh... a GUI, it makes me shudder

I think that's quite beyond my goal at the moment, but of course, it would be welcome.

It would really be out of altruism, since a lot more people would use a tool like this if there were a GUI running under Windows/Mac. But it might also further the cause that properly sized PDFs is an ebook format afterall!

Quote:

For the moment, let's see if the introduction of this <meta name="pdf-style"> has any acceptance...

I'll cross my fingers!

Jellby · 09-15-2009, 07:00 AM

Quote:

Originally Posted by frabjous

Then the PDF used Georgia, not Droid, so the !important flag is not "cascading down" so-to-speak.

Yes, that's a consequence of how CSS works. To fully control the format of a (let's call it) "badly structured" book, one would need some knowledge of CSS rules and of the classes used in the document, etc. I don't see a way out of it, other than including a full-feature CSS parser/editor/creator, which, well, I'm absolutely not going to write

This does not mean it's not possible to override anything (and everything) in the book, just that it may be not so straightforward and "drag-and-drop" as wished. In this particular case I guess you could include in your default.css:

Code:

.mypararagraph { font-family: inherit !important; }

and rather than including it in the ~/epub2pdf/default.css, I'd copy it to the current directory, and delete it once the conversion is done.

Quote:

I'll cross my fingers!

Before adopting it, however, it would be desirable to discuss the name a bit. Maybe "pdf-style" is too broad, and something like "prince-style" would be better, since there might be other XHTML-to-PDF converters one could use.

frabjous · 09-15-2009, 11:57 AM

Quote:

Originally Posted by Jellby

Yes, that's a consequence of how CSS works. To fully control the format of a (let's call it) "badly structured" book, one would need some knowledge of CSS rules and of the classes used in the document, etc. I don't see a way out of it, other than including a full-feature CSS parser/editor/creator, which, well, I'm absolutely not going to write

I know how CSS works in general; I just didn't know about the !important flag and how it worked until now.

But what I had in mind above is something that went through all the CSS of the source and just changed any "font-family: XXX" attributes to "font-family: inherit" (and stripped any obsolete <font face="XXX"> tags) or something like that. I recognize (as I admitted in an earlier post) that this might be dangerous, since there might be a good reason for it changing it in a particular portion (e.g., in a multilingual document). The idea is that this would be an optional feature of the script that one would have to enable "force font change" or something like that.

I don't think that would require a full CSS parser/editor. A couple regex search and replace should handle it, no? Actually, I think I could probably alter your script accordingly with a few sed lines.

Quote:

This does not mean it's not possible to override anything (and everything) in the book, just that it may be not so straightforward and "drag-and-drop" as wished. In this particular case I guess you could include in your default.css:

Code:

.mypararagraph { font-family: inherit !important; }

That was just an example. I'm not going to guess what class names the CSS of the book uses. If I open the source CSS to look, I might as well just alter the source CSS directly.

Quote:

Before adopting it, however, it would be desirable to discuss the name a bit. Maybe "pdf-style" is too broad, and something like "prince-style" would be better, since there might be other XHTML-to-PDF converters one could use.

prince-style is probably better, unless there's any indication that the Prince extensions are being picked up by other software, which I suppose is a possibility, since they're fairly straightforward and reasonable.

Jellby · 09-15-2009, 01:52 PM

Quote:

Originally Posted by frabjous

But what I had in mind above is something that went through all the CSS of the source and just changed any "font-family: XXX" attributes to "font-family: inherit" (and stripped any obsolete <font face="XXX"> tags) or something like that.

In my opinion, that would be the task of a different king of ePUB processor/editor, something like Calibre or Sigil. I don't want (at least at the moment) to do "modify" the ePUB, just the changes that can be done with CSS (i.e., what I'd expect an ePUB reader to be able to do).

Quote:

That was just an example. I'm not going to guess what class names the CSS of the book uses. If I open the source CSS to look, I might as well just alter the source CSS directly.

I know, maybe I didn't explain myself clearly. At the moment what I have in mind is something that works for "well behaved" ePUBs (those that use few classes, with relative units where possible, etc.) For other other more messy books, I don't think a robust and simple solution exists other than uncompressing the .epub, having a look at the CSS code and writing an appropriate CSS stylesheet to override it (the advantage over modifying the source CSS is precisely that you don't have to modify the original book, you can distribute the stylesheet and modify it in the future). This is what I meant with "some knowledge of CSS rules and of the classes used in the document".

frabjous · 09-15-2009, 02:10 PM

From what I've read, Stanza, at any rate, allows the user to change the font on the fly, and it reaches down and changes it at every level.

It would be great if someone from the IDPF or somesuch developed a system of standard ePub class names that would get modified or added to only when necessary. I don't know, but they already have official recommendations for good ePub CSS practices, like using relative sizes rather than absolutes sizes for subsidiary elements.

Calibre on the other hand, at present, generates a whole bunch of custom class names when it processes any document. Every new use of style="..." inside a tag gets turned into a new calibre class, which is kind of messy.

ahi · 09-15-2009, 02:35 PM

Quote:

Originally Posted by frabjous

From what I've read, Stanza, at any rate, allows the user to change the font on the fly, and it reaches down and changes it at every level.

It would be great if someone from the IDPF or somesuch developed a system of standard ePub class names that would get modified or added to only when necessary. I don't know, but they already have official recommendations for good ePub CSS practices, like using relative sizes rather than absolutes sizes for subsidiary elements.

Calibre on the other hand, at present, generates a whole bunch of custom class names when it processes any document. Every new use of style="..." inside a tag gets turned into a new calibre class, which is kind of messy.

Yep... it's just that sort of shenanigans that may well complicate processing HTML beyond that of my heavily filtered RTF import.

- Ahi

Jellby · 09-15-2009, 02:36 PM

I've updated the script uploaded in post #11. Now it uses XMLStarlet to parse the OPF file (I believe that's more robust, and no need for dos2unix now), and I've changed the special meta@name to "prince-style". The default.css file has also been changed a bit: I moved the "@page title" style to the book-specific stylesheet, where it rather belongs, and changed my preferred fonts (grew tired of the default "st" ligatures in FreeSerif).

ahi · 09-15-2009, 02:38 PM

Quote:

Originally Posted by Jellby

I've updated the script uploaded in post #11. Now it uses XMLStarlet to parse the OPF file (I believe that's more robust, and no need for dos2unix now), and I've changed the special meta@name to "prince-style". The default.css file has also been changed a bit: I moved the "@page title" style to the book-specific stylesheet, where it rather belongs, and changed my preferred fonts (grew tired of the default "st" ligatures in FreeSerif).

Yeah... the "st" ligature really only belongs in titles and other "ornamental" or "also ornamental" text. I believe it is not considered to be one of the ligatures that belongs to body text.

- Ahi

frabjous · 09-16-2009, 09:07 AM

Thanks for the new script.

Unfortunately, I can't test it right now because I'm having trouble installing/configuring XML starlet. (I know, I know... I'm the one who recommended it... I should really try something myself before I do that...) I'm determined to get it working though, so I'll let you know.

ahi · 09-16-2009, 09:50 AM

Quote:

Originally Posted by frabjous

Thanks for the new script.

Unfortunately, I can't test it right now because I'm having trouble installing/configuring XML starlet. (I know, I know... I'm the one who recommended it... I should really try something myself before I do that...) I'm determined to get it working though, so I'll let you know.

Jellby, is the XML parsing the most complicated thing your script does?

If so, it might be fairly simple to turn it into a rather plain Python script (which then, I am given to understand, can be with reasonable ease turned into an .exe as well). The preliminary HTML parsing part of pacify should be more than up to the task of fishing a couple of attribute values out of barely structured XML.

- Ahi

Jellby · 09-16-2009, 10:26 AM

Quote:

Originally Posted by frabjous

Unfortunately, I can't test it right now because I'm having trouble installing/configuring XML starlet. (I know, I know... I'm the one who recommended it... I should really try something myself before I do that...) I'm determined to get it working though, so I'll let you know.

Well, I decided on using it mainly because it was so easy to install (it's on the official repositories) even on my oldest system, a Mandriva 2005.

Quote:

Originally Posted by ahi

Jellby, is the XML parsing the most complicated thing your script does?

Yes, it is. The rest is only checking the files exist and passing options to prince. The script boils down to:

Code:

prince -s default.css -s bookstyle.css -o output.pdf Cover.xhtml Chapter-01.xhtml Chapter-02.xhtml ...

after uncompressing the .epub and getting the correct names and paths for all these filenames.

ahi · 09-16-2009, 10:29 AM

Quote:

Originally Posted by Jellby

Yes, it is. The rest is only checking the files exist and passing options to prince. The script boils down to:

Code:

prince -s default.css -s bookstyle.css -o output.pdf Cover.xhtml Chapter-01.xhtml Chapter-02.xhtml ...

after uncompressing the .epub and getting the correct names and paths for all these filenames.

Can you post what attributes of what tags need to be pulled? Or ought it be obvious from your source even if I have no in-depth knowledge of your choice of scripting language?

- Ahi

Jellby · 09-16-2009, 10:53 AM

I think it should be pretty obvious, the XML parsing is done by XMLStarlet, which uses XPath expressions (I had no knowledge of XPath until yesterday

). This is what is needed:

Open the META-INF/container.xml file. There should be a <rootfile> element with a full-path attribute. The value of this attribute is the path to the main OPF file.

Open the main OPF file. There should be a <spine> element there. The <spine> contains a list of <itemref> elements, each of them with a idref attribute. Get the values of these attributes in the order they are defined.

In the OPF file there should be a <manifest> element too. For each idref obtained in the previous step, there should be a <item> element inside the <manifest> with an id attribute identical to the idref. The href attribute of each <item> has the file path and name (relative to the directory where the OPF file is located).

Now you have the ordered list of all the files in the ePUB (actually, assuming there are no fallback items).

To get the "bookstyle.css": Find, in the OPF file, the <metadata> element, and inside it a <meta> element with an attribute name with the value "prince-style". The content attribute of this element is the id that you have to look for in the <manifest>, as done above for the items in the <spine>.

"default.css" and "output.pdf" are command-line or configuration arguments, those are not read from XML.

ahi · 09-16-2009, 11:09 AM

Quote:

Originally Posted by Jellby

I think it should be pretty obvious, the XML parsing is done by XMLStarlet, which uses XPath expressions (I had no knowledge of XPath until yesterday

). This is what is needed:

Open the META-INF/container.xml file. There should be a <rootfile> element with a full-path attribute. The value of this attribute is the path to the main OPF file.

Open the main OPF file. There should be a <spine> element there. The <spine> contains a list of <itemref> elements, each of them with a idref attribute. Get the values of these attributes in the order they are defined.

In the OPF file there should be a <manifest> element too. For each idref obtained in the previous step, there should be a <item> element inside the <manifest> with an id attribute identical to the idref. The href attribute of each <item> has the file path and name (relative to the directory where the OPF file is located).

Now you have the ordered list of all the files in the ePUB (actually, assuming there are no fallback items).

To get the "bookstyle.css": Find, in the OPF file, the <metadata> element, and inside it a <meta> element with an attribute name with the value "prince-style". The content attribute of this element is the id that you have to look for in the <manifest>, as done above for the items in the <spine>.

"default.css" and "output.pdf" are command-line or configuration arguments, those are not read from XML.

Well, as I have time (maybe tonight, but definitely in the next few days), I'll whip up a python script for that... and will, once completed, relinquish it to you!

If you picked up XPath as quickly as you did, you'll probably get Python easily enough as well. It's a great language, albeit you might have to make peace with some of its oddities.

The CSS stuff doesn't compromise the final PDF output? In a LaTeX context, my intuition would be to assume less is more and ignore CSS clowning around, in favour of LaTeX class defaults (whether customized or not).

- Ahi

Jellby · 09-16-2009, 11:47 AM

Quote:

Originally Posted by ahi

The CSS stuff doesn't compromise the final PDF output?

Not if the CSS is well designed. The intent is not "fixing" arbitrary ePUBs, but converting good ePUBs into good PDFs.

If the CSS is so bad one would better drop it, one could pass --no-author-style to prince. I guess I could add an option for this in the script, that should address Frabjous's worries with fonts as well.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Creating XML book listing with Calibre	JTAL604622	Library Management	5	06-01-2010 02:57 PM
Question about creating PDFs (resolved - my error, d'oh)	Prince Hal	PDF	19	03-02-2010 11:30 PM
Software for creating image-based PDFs	301verbs	Workshop	2	06-13-2009 12:51 PM
Mobile reader being able to display A4 pdfs	Mononofu	Which one should I buy?	10	01-17-2009 07:22 AM
Creating media.xml manually	pepak	Sony Reader	5	11-28-2008 10:26 AM

09-15-2009, 02:10 PM	#20
frabjous Wizard Posts: 1,213 Karma: 12890 Join Date: Feb 2009 Location: Amherst, Massachusetts, USA Device: Sony PRS-505	From what I've read, Stanza, at any rate, allows the user to change the font on the fly, and it reaches down and changes it at every level. It would be great if someone from the IDPF or somesuch developed a system of standard ePub class names that would get modified or added to only when necessary. I don't know, but they already have official recommendations for good ePub CSS practices, like using relative sizes rather than absolutes sizes for subsidiary elements. Calibre on the other hand, at present, generates a whole bunch of custom class names when it processes any document. Every new use of style="..." inside a tag gets turned into a new calibre class, which is kind of messy.

09-15-2009, 02:36 PM	#22
Jellby frumious Bandersnatch Posts: 7,549 Karma: 19500001 Join Date: Jan 2008 Location: Spaniard in Sweden Device: Cybook Orizon, Kobo Aura	I've updated the script uploaded in post #11. Now it uses XMLStarlet to parse the OPF file (I believe that's more robust, and no need for dos2unix now), and I've changed the special meta@name to "prince-style". The default.css file has also been changed a bit: I moved the "@page title" style to the book-specific stylesheet, where it rather belongs, and changed my preferred fonts (grew tired of the default "st" ligatures in FreeSerif).

09-16-2009, 09:07 AM	#24
frabjous Wizard Posts: 1,213 Karma: 12890 Join Date: Feb 2009 Location: Amherst, Massachusetts, USA Device: Sony PRS-505	Thanks for the new script. Unfortunately, I can't test it right now because I'm having trouble installing/configuring XML starlet. (I know, I know... I'm the one who recommended it... I should really try something myself before I do that...) I'm determined to get it working though, so I'll let you know.

09-16-2009, 10:53 AM	#28
Jellby frumious Bandersnatch Posts: 7,549 Karma: 19500001 Join Date: Jan 2008 Location: Spaniard in Sweden Device: Cybook Orizon, Kobo Aura	I think it should be pretty obvious, the XML parsing is done by XMLStarlet, which uses XPath expressions (I had no knowledge of XPath until yesterday ). This is what is needed: Open the META-INF/container.xml file. There should be a <rootfile> element with a full-path attribute. The value of this attribute is the path to the main OPF file. Open the main OPF file. There should be a <spine> element there. The <spine> contains a list of <itemref> elements, each of them with a idref attribute. Get the values of these attributes in the order they are defined. In the OPF file there should be a <manifest> element too. For each idref obtained in the previous step, there should be a <item> element inside the <manifest> with an id attribute identical to the idref. The href attribute of each <item> has the file path and name (relative to the directory where the OPF file is located). Now you have the ordered list of all the files in the ePUB (actually, assuming there are no fallback items). To get the "bookstyle.css": Find, in the OPF file, the <metadata> element, and inside it a <meta> element with an attribute name with the value "prince-style". The content attribute of this element is the id that you have to look for in the <manifest>, as done above for the items in the <spine>. "default.css" and "output.pdf" are command-line or configuration arguments, those are not read from XML.

Advert

Advert