Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre > Conversion

Notices

Reply
 
Thread Tools Search this Thread
Old 08-27-2011, 06:24 PM   #1
AhShoo5n
Junior Member
AhShoo5n began at the beginning.
 
Posts: 9
Karma: 10
Join Date: Aug 2011
Device: Kindle 3
TOC errors in PDF to EPUB conversion - software or user error?

Hi,
I've only been using Calibre and a Kindle for a few days. And before saying anything else let me state how massively impressed I am by Calibre's range of features and the quality of its metadata handling.

I should also mention that I have read these threads:
* Read this before Posting PDF Questions
* Chapter Detection/Table of Contents Tutorial

Like many others, I have masses of ebooks in PDF format, which I'd like to read on my Kindle. Fortunately, many of them have very simple (1-column) layouts and typography.
For such ebooks, all I really need is to convert 'unflowed' PDF text into 'flowable' MOBI text (although I'd like to keep 'cannonical' copies of my ebooks in epub format, for editability).

I don't really need hypertext TOCs, although I'd be happy to use them if they worked.

Here's the two things that are frustrating me:

1. I'm unable to stop Calibre from inserting a HTML TOC within the main text of the epub output, i.e. within index_split_000.html. ... I really don't want this TOC at all, but I especially don't want it to look so hideous and its hypertext links to have no matching target anchors.

2. Calibre is amazingly good at detecting chapter breaks within the PDFs I've sampled, and the chapter headings listed in toc.ncx are surprisingly accurate (given the crapness of PDF 'markup') ... but ... the ids which they point to in index_split_00*.html just don't exist.

Since the converter is so accurately identifying the chapter headings, why isn't it adding ids to the chapter heading elements in the epub HTML? ... even if those elements are 'p' tags rather than h2, etc.

It's mentally trivial, though a bit laborious, to add the missing ids to the html files in the epub, but surely there is no technical reason why the converter can't do this automatically?

Similarly, I can delete the hideous HTML TOC at the beginning of index_split_000.html by hand, but why should I have to bother?

I suspect that I am I missing some blindingly obvious settings, somewhere ...

Can someone give some clues where they might be and what they are called?
AhShoo5n is offline   Reply With Quote
Old 08-27-2011, 07:11 PM   #2
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
Any TOC you see at the beginning of the pdf would be a side effect of the original pdf's inline/user visible TOC. As far as Calibre is concerned that TOC is just a bunch of text indistinguishable from any other text in the book. So it just leaves it in place. While it's easy for a human to see and fix that problem, creating a heuristic for a machine to accurately guess where that text is would be quite difficult.
ldolse is offline   Reply With Quote
Old 08-27-2011, 11:46 PM   #3
AhShoo5n
Junior Member
AhShoo5n began at the beginning.
 
Posts: 9
Karma: 10
Join Date: Aug 2011
Device: Kindle 3
Please ignore previous content of this comment. I was looking at the wrong test PDF.

What I now see is a user-visible, hypertext-enabled, PDF ToC, i.e. exactly what you correctly identified. But I still don't really understand what is going on.

Outstanding questions:

1. If the converter sees the embedded ToC as mere text, why does it put link anchors around the ToC items in the epub HTML?

2. If, as it seems, the converter recognises that the PDF ToC is hypertext. Why can't the converter identify the link targets in exactly the same way as a non-Adobe PDF reader would do? I assume that the PDF standard is pretty explicit about how link anchors are encoded (i.e. no need for complex heuristics)?

Last edited by AhShoo5n; 08-28-2011 at 12:19 AM.
AhShoo5n is offline   Reply With Quote
Old 08-28-2011, 03:12 AM   #4
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
The 'converter' consists of two components - one is a third party utility, Poppler's pdftohtml. Pdftohtml creates some basic html markup which is essentially completely unreadable as an ebook - enable the debug output to see what Poppler's raw html looks like. The links are generated by this code - all they do is link to a page number rather than the actual chapter heading.

Calibre then does a considerable amount of massaging of that html code to make something readable as an ebook. One of these steps is fixing paragraphs/sentences across page breaks and line breaks. Retaining the hyperlinks would break this algorithm (since every page has a hyperlink whether it's used by a TOC or not), so it's either retain your links or have broken sentences... Fixing broken sentences won out.

There's always room for improvement, and patches are welcome. But odds are it won't happen with the current version of the pdf conversion code.
ldolse is offline   Reply With Quote
Old 08-28-2011, 09:05 AM   #5
AhShoo5n
Junior Member
AhShoo5n began at the beginning.
 
Posts: 9
Karma: 10
Join Date: Aug 2011
Device: Kindle 3
Thank you so much ldolse.

That really helped me understand what is going on. I'd guessed that it had something to do with the way the page-breaking heuristics work.

It seems like I've been using Poppler's pdftohtml since the 1990s (probably have been), so it's about time I looked into it properly.

Thanks again.
AhShoo5n is offline   Reply With Quote
Reply

Tags
conversion, epub, pdf, toc, toc.ncx

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Missing TOC, top/bottom margins and hyperlinks in ePub -> PDF conversion amoroso Conversion 2 04-26-2011 11:48 AM
Multi-level TOC broken in epub->epub conversion siebert Conversion 14 03-09-2011 06:38 PM
HELP epub conversion tool - errors cysag Calibre 1 01-01-2011 02:07 AM
Conversion Error - pdf to epub Quint Calibre 3 09-26-2010 10:06 PM
Conversion to Mobi to ePub errors erik_reader Conversion 5 08-07-2010 03:03 AM


All times are GMT -4. The time now is 10:21 PM.


MobileRead.com is a privately owned, operated and funded community.