Converting pandoc generated HTML to ePUB with Calibre

Wintermute · 04-14-2011, 07:38 PM

Hi,

I recently formatted a (very long) book with pandoc. I find pandoc nice and really convenient for generating HTML. Then I convert the HTML into ePUB using calibre. Pandoc itself can generate ePUB, but the process is very automatic, and not many things can be customized. For instance, TOC levels cannot be customized (only chapters appear in TOC). That's the reason I use calibre to convert from HTML to ePUB: using the command line I can nicely control a lot of stuff about the generated ePUB (cover, TOC, etc...).

The problem is that the book I'm converting is in Spanish, and some of the chapter's titles contains accents (á é í ó ú). For every section element (h1, h2, etc...) pandoc generates and id you can use to refer to that element in the text. For example, if a chapter is entitled "Introducción", pandoc generates this into the HTML.

Code:

<h1 id="introducción">Introducción</h1>

Calibre crashes if some hX header contains non-ascii characters.

Here's calibre's output.

Code:

Converting ebook with calibre
1% Converting input to HTML...
InputFormatPlugin: HTML Input running
on /home/Literature/Calibre-tests y pruebas/test-pandoc/pandoc-example.html
Language not specified
Building file list...
Normalizing filename cases
Rewriting HTML links
34% Running transforms on ebook...
Merging user specified metadata...
Detecting structure...
        Detected chapter: My Book
        Detected chapter: Chapter One
        Detected chapter: Chapter Two
Auto generated TOC with 12 entries.
Flattening CSS and remapping font sizes...
Source base font size is 12.00000pt
Cleaning up manifest...
Trimming unused files from manifest...
Creating EPUB Output...
67% Creating EPUB Output
Traceback (most recent call last):
  File "/usr/bin/ebook-convert", line 19, in <module>
    sys.exit(main())
  File "/usr/lib/calibre/calibre/ebooks/conversion/cli.py", line 279, in main
    plumber.run()                                                                                                                                                                
  File "/usr/lib/calibre/calibre/ebooks/conversion/plumber.py", line 1018, in run                                                                                                
    self.opts, self.log)                                                                                                                                                         
  File "/usr/lib/calibre/calibre/ebooks/epub/output.py", line 169, in convert                                                                                                    
    split(self.oeb, self.opts)
  File "/usr/lib/calibre/calibre/ebooks/oeb/transforms/split.py", line 57, in __call__
    self.split_item(item)
  File "/usr/lib/calibre/calibre/ebooks/oeb/transforms/split.py", line 64, in split_item
    page_breaks, page_break_ids = self.find_page_breaks(item)
  File "/usr/lib/calibre/calibre/ebooks/oeb/transforms/split.py", line 123, in find_page_breaks
    page_breaks_.append((XPath('//*[@id=%r]'%id),
  File "xpath.pxi", line 446, in lxml.etree.XPath.__init__ (src/lxml/lxml.etree.c:115005)
  File "xpath.pxi", line 214, in lxml.etree._XPathEvaluatorBase._raise_parse_error (src/lxml/lxml.etree.c:112698)
lxml.etree.XPathSyntaxError: Invalid predicate

Is this a normal calibre behaviour or it's a pandoc's bug?

Thanks in advance for your help.

kovidgoyal · 04-14-2011, 07:59 PM

The problem is the non ascii characters in the id attribute. That is illegal in XHTML, as far as I recall.

Wintermute · 04-15-2011, 01:25 PM

Quote:

Originally Posted by kovidgoyal

The problem is the non ascii characters in the id attribute. That is illegal in XHTML, as far as I recall.

Thanks Kovid.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Covers in ePub files generated by Calibre	daviddem	Calibre	14	06-30-2011 09:18 PM
How much shall I pay you for converting HTML to ePUB?	vadimzn	ePub	8	04-07-2011 01:46 AM
Calibre Indent Issue When Removing Blank Lines (Converting From HTML to MOBI or EPUB)	David Derrico	Calibre	5	08-04-2010 12:13 AM
bookmark issues converting HTML to EPUB	isabellkirsten	Calibre	0	04-09-2010 11:47 PM

04-14-2011, 07:59 PM	#2
kovidgoyal creator of calibre Posts: 43,844 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	The problem is the non ascii characters in the id attribute. That is illegal in XHTML, as far as I recall.