08-27-2011, 05:24 PM | #1 |
Junior Member
Posts: 9
Karma: 10
Join Date: Aug 2011
Device: Kindle 3
|
TOC errors in PDF to EPUB conversion - software or user error?
Hi,
I've only been using Calibre and a Kindle for a few days. And before saying anything else let me state how massively impressed I am by Calibre's range of features and the quality of its metadata handling. I should also mention that I have read these threads: * Read this before Posting PDF Questions * Chapter Detection/Table of Contents Tutorial Like many others, I have masses of ebooks in PDF format, which I'd like to read on my Kindle. Fortunately, many of them have very simple (1-column) layouts and typography. For such ebooks, all I really need is to convert 'unflowed' PDF text into 'flowable' MOBI text (although I'd like to keep 'cannonical' copies of my ebooks in epub format, for editability). I don't really need hypertext TOCs, although I'd be happy to use them if they worked. Here's the two things that are frustrating me: 1. I'm unable to stop Calibre from inserting a HTML TOC within the main text of the epub output, i.e. within index_split_000.html. ... I really don't want this TOC at all, but I especially don't want it to look so hideous and its hypertext links to have no matching target anchors. 2. Calibre is amazingly good at detecting chapter breaks within the PDFs I've sampled, and the chapter headings listed in toc.ncx are surprisingly accurate (given the crapness of PDF 'markup') ... but ... the ids which they point to in index_split_00*.html just don't exist. Since the converter is so accurately identifying the chapter headings, why isn't it adding ids to the chapter heading elements in the epub HTML? ... even if those elements are 'p' tags rather than h2, etc. It's mentally trivial, though a bit laborious, to add the missing ids to the html files in the epub, but surely there is no technical reason why the converter can't do this automatically? Similarly, I can delete the hideous HTML TOC at the beginning of index_split_000.html by hand, but why should I have to bother? I suspect that I am I missing some blindingly obvious settings, somewhere ... Can someone give some clues where they might be and what they are called? |
08-27-2011, 06:11 PM | #2 |
Wizard
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
Any TOC you see at the beginning of the pdf would be a side effect of the original pdf's inline/user visible TOC. As far as Calibre is concerned that TOC is just a bunch of text indistinguishable from any other text in the book. So it just leaves it in place. While it's easy for a human to see and fix that problem, creating a heuristic for a machine to accurately guess where that text is would be quite difficult.
|
Advert | |
|
08-27-2011, 10:46 PM | #3 |
Junior Member
Posts: 9
Karma: 10
Join Date: Aug 2011
Device: Kindle 3
|
Please ignore previous content of this comment. I was looking at the wrong test PDF.
What I now see is a user-visible, hypertext-enabled, PDF ToC, i.e. exactly what you correctly identified. But I still don't really understand what is going on. Outstanding questions: 1. If the converter sees the embedded ToC as mere text, why does it put link anchors around the ToC items in the epub HTML? 2. If, as it seems, the converter recognises that the PDF ToC is hypertext. Why can't the converter identify the link targets in exactly the same way as a non-Adobe PDF reader would do? I assume that the PDF standard is pretty explicit about how link anchors are encoded (i.e. no need for complex heuristics)? Last edited by AhShoo5n; 08-27-2011 at 11:19 PM. |
08-28-2011, 02:12 AM | #4 |
Wizard
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
The 'converter' consists of two components - one is a third party utility, Poppler's pdftohtml. Pdftohtml creates some basic html markup which is essentially completely unreadable as an ebook - enable the debug output to see what Poppler's raw html looks like. The links are generated by this code - all they do is link to a page number rather than the actual chapter heading.
Calibre then does a considerable amount of massaging of that html code to make something readable as an ebook. One of these steps is fixing paragraphs/sentences across page breaks and line breaks. Retaining the hyperlinks would break this algorithm (since every page has a hyperlink whether it's used by a TOC or not), so it's either retain your links or have broken sentences... Fixing broken sentences won out. There's always room for improvement, and patches are welcome. But odds are it won't happen with the current version of the pdf conversion code. |
08-28-2011, 08:05 AM | #5 |
Junior Member
Posts: 9
Karma: 10
Join Date: Aug 2011
Device: Kindle 3
|
Thank you so much ldolse.
That really helped me understand what is going on. I'd guessed that it had something to do with the way the page-breaking heuristics work. It seems like I've been using Poppler's pdftohtml since the 1990s (probably have been), so it's about time I looked into it properly. Thanks again. |
Advert | |
|
Tags |
conversion, epub, pdf, toc, toc.ncx |
Thread Tools | Search this Thread |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Missing TOC, top/bottom margins and hyperlinks in ePub -> PDF conversion | amoroso | Conversion | 2 | 04-26-2011 10:48 AM |
Multi-level TOC broken in epub->epub conversion | siebert | Conversion | 14 | 03-09-2011 05:38 PM |
HELP epub conversion tool - errors | cysag | Calibre | 1 | 01-01-2011 01:07 AM |
Conversion Error - pdf to epub | Quint | Calibre | 3 | 09-26-2010 09:06 PM |
Conversion to Mobi to ePub errors | erik_reader | Conversion | 5 | 08-07-2010 02:03 AM |