View Single Post
Old 08-27-2011, 05:24 PM   #1
AhShoo5n
Junior Member
AhShoo5n began at the beginning.
 
Posts: 9
Karma: 10
Join Date: Aug 2011
Device: Kindle 3
TOC errors in PDF to EPUB conversion - software or user error?

Hi,
I've only been using Calibre and a Kindle for a few days. And before saying anything else let me state how massively impressed I am by Calibre's range of features and the quality of its metadata handling.

I should also mention that I have read these threads:
* Read this before Posting PDF Questions
* Chapter Detection/Table of Contents Tutorial

Like many others, I have masses of ebooks in PDF format, which I'd like to read on my Kindle. Fortunately, many of them have very simple (1-column) layouts and typography.
For such ebooks, all I really need is to convert 'unflowed' PDF text into 'flowable' MOBI text (although I'd like to keep 'cannonical' copies of my ebooks in epub format, for editability).

I don't really need hypertext TOCs, although I'd be happy to use them if they worked.

Here's the two things that are frustrating me:

1. I'm unable to stop Calibre from inserting a HTML TOC within the main text of the epub output, i.e. within index_split_000.html. ... I really don't want this TOC at all, but I especially don't want it to look so hideous and its hypertext links to have no matching target anchors.

2. Calibre is amazingly good at detecting chapter breaks within the PDFs I've sampled, and the chapter headings listed in toc.ncx are surprisingly accurate (given the crapness of PDF 'markup') ... but ... the ids which they point to in index_split_00*.html just don't exist.

Since the converter is so accurately identifying the chapter headings, why isn't it adding ids to the chapter heading elements in the epub HTML? ... even if those elements are 'p' tags rather than h2, etc.

It's mentally trivial, though a bit laborious, to add the missing ids to the html files in the epub, but surely there is no technical reason why the converter can't do this automatically?

Similarly, I can delete the hideous HTML TOC at the beginning of index_split_000.html by hand, but why should I have to bother?

I suspect that I am I missing some blindingly obvious settings, somewhere ...

Can someone give some clues where they might be and what they are called?
AhShoo5n is offline   Reply With Quote