Where to look for conversion problems?

Claghorn · 08-18-2012, 04:18 PM

I'm trying to learn how to perfect the epub files I generate, and I started with the largest and most complex ebook I own to maximize my problems :-).

This is a kindle edition of the 4 book bundle of the first 4 volumes of Game of Thrones.

It is an "old" format mobi file, not the new K8 version, but I can unpack it with the mobi unpack plugin, and run the massive html file through tidy -xml -indent to make it easier to read.

As near as I can tell, the .ncx file is a perfectly correct table of contents, and in the html itself, there is a mbp pagebreak tag before every chapter and an anchor right after the pagebreak which the TOC points at. In other words, all the structure seems to be correctly defined.

But when I run it through a MOBI to EPUB conversion, I get lots of broken TOC entries and missing page breaks. All the front sections with the cover, title page, copyright, etc. are jammed together with no page breaks.

The biggest problem though are the chapters. In the book, each chapter starts with a little decorative image. The html anchors all come after the pagebreak and before the image, yet sometimes (about 20% of the chapters) the resulting epub file has the image at the end of the previous page and the text of the chapter starts on the next page. It is always the same chapters that do this, yet I can't see anything in the html that make those chapters different.

These are all massive files and also copyrighted, so there isn't really anything I can post anywhere, so I'm just wondering if anyone has any advice about how to proceed?

Is there some option I'm not noticing in the conversion that says "Trust the .ncx file"?

Some other option that says "Always split the epub at the mbp pagebreak tags?"

Claghorn · 08-18-2012, 08:57 PM

I've been fooling with editing the input stage of the debug output, and wrote a perl script to modify the html files to put header markup with class="chapter" around the anchors the toc.ncx file points at, then I do the conversion matching h1 through h3 headers with class=chapter and tell it to force replace the TOC. This works nearly flawlessly, and leaves me even more confused about why the manual says the convert will use the existing TOC if the file has one. All I've done is move the existing TOC into the html, so it should be the same.

Anyway, this does seem to solve my problems, and the few errors left appear to be actual bugs in the original doc which I can fix by hand.

Next I need to finish my script to convert all the opaque grayscale .jpeg files into .png files with transparency so they'll look natural when you change the background color.

08-18-2012, 04:18 PM	#1
Claghorn Member Posts: 16 Karma: 10 Join Date: Aug 2012 Device: Nexus 7	Where to look for conversion problems? I'm trying to learn how to perfect the epub files I generate, and I started with the largest and most complex ebook I own to maximize my problems :-). This is a kindle edition of the 4 book bundle of the first 4 volumes of Game of Thrones. It is an "old" format mobi file, not the new K8 version, but I can unpack it with the mobi unpack plugin, and run the massive html file through tidy -xml -indent to make it easier to read. As near as I can tell, the .ncx file is a perfectly correct table of contents, and in the html itself, there is a mbp pagebreak tag before every chapter and an anchor right after the pagebreak which the TOC points at. In other words, all the structure seems to be correctly defined. But when I run it through a MOBI to EPUB conversion, I get lots of broken TOC entries and missing page breaks. All the front sections with the cover, title page, copyright, etc. are jammed together with no page breaks. The biggest problem though are the chapters. In the book, each chapter starts with a little decorative image. The html anchors all come after the pagebreak and before the image, yet sometimes (about 20% of the chapters) the resulting epub file has the image at the end of the previous page and the text of the chapter starts on the next page. It is always the same chapters that do this, yet I can't see anything in the html that make those chapters different. These are all massive files and also copyrighted, so there isn't really anything I can post anywhere, so I'm just wondering if anyone has any advice about how to proceed? Is there some option I'm not noticing in the conversion that says "Trust the .ncx file"? Some other option that says "Always split the epub at the mbp pagebreak tags?"

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Problems with Conversion	jostegall	Conversion	4	07-31-2011 10:32 PM
Conversion Problems	walters291	Conversion	5	07-06-2011 01:50 AM
Conversion problems	drftr	Calibre	3	11-30-2010 04:51 PM
Problems with conversion	CrazyTosser	Calibre	0	10-25-2010 10:58 AM
Conversion problems	DrZoidberg	Calibre	4	02-13-2010 12:52 PM

08-18-2012, 08:57 PM	#2
Claghorn Member Posts: 16 Karma: 10 Join Date: Aug 2012 Device: Nexus 7	I've been fooling with editing the input stage of the debug output, and wrote a perl script to modify the html files to put header markup with class="chapter" around the anchors the toc.ncx file points at, then I do the conversion matching h1 through h3 headers with class=chapter and tell it to force replace the TOC. This works nearly flawlessly, and leaves me even more confused about why the manual says the convert will use the existing TOC if the file has one. All I've done is move the existing TOC into the html, so it should be the same. Anyway, this does seem to solve my problems, and the few errors left appear to be actual bugs in the original doc which I can fix by hand. Next I need to finish my script to convert all the opaque grayscale .jpeg files into .png files with transparency so they'll look natural when you change the background color.

Advert