Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Conversion

Notices

Reply
 
Thread Tools Search this Thread
Old 02-20-2012, 01:59 PM   #1
nimblebooks
Enthusiast
nimblebooks began at the beginning.
 
Posts: 28
Karma: 10
Join Date: May 2010
Device: Kindle
HTML input plugin stripping text within toc tags in child html file

Hi,

Same problem as a while ago but have done some more testing. Files attached.

ebook-convert tmp/temptoc.html $mediatargetpath$sku".epub" --max-levels=1 --toc-threshold=6 --cover=$imagedir$sku$cover_image_extension --book-producer="Nimble Combinatorial Publishing" --publisher="Nimble Combinatorial Publishing" --max-toc-links=20 --preserve-cover-aspect-ratio -vv --debug-pipeline="debug" --duplicate-links-in-toc --chapter="/"

From debug, I can tell thathe conversion is getting messed up in the input plugin stage: the following HTML in the source file safe1.html generated from the API call to http://en.wikipedia.org/w/index.php?...eship_Bismarck

Code:
<table id="toc" class="toc">
<tr>
<td>
<div id="toctitle">
<h2>Contents</h2>
</div>
<ul>
<li class="toclevel-1 tocsection-1"><a href="#Construction_and_characteristics"><span class="tocnumber">1</span> <span class="toctext">Construction and characteristics</span></a></li>
<li class="toclevel-1 tocsection-2"><a href="#Service_history"><span class="tocnumber">2</span> <span class="toctext">Service history</span></a>
<ul>
<li class="toclevel-2 tocsection-3"><a href="#Operation_Rhein.C3.BCbung"><span class="tocnumber">2.1</span> <span class="toctext">Operation Rheinübung</span></a>
<ul>
<li class="toclevel-3 tocsection-4"><a href="#Battle_of_the_Denmark_Strait"><span class="tocnumber">2.1.1</span> <span class="toctext">Battle of the Denmark Strait</span></a></li>
<li class="toclevel-3 tocsection-5"><a href="#The_chase"><span class="tocnumber">2.1.2</span> <span class="toctext">The chase</span></a></li>
<li class="toclevel-3 tocsection-6"><a href="#Sinking"><span class="tocnumber">2.1.3</span> <span class="toctext">Sinking</span></a></li>
</ul>
</li>
</ul>
</li>
<li class="toclevel-1 tocsection-7"><a href="#Media_portrayals_of_sinking"><span class="tocnumber">3</span> <span class="toctext">Media portrayals of sinking</span></a></li>
<li class="toclevel-1 tocsection-8"><a href="#Discovery_of_the_wreck"><span class="tocnumber">4</span> <span class="toctext">Discovery of the wreck</span></a>
<ul>
<li class="toclevel-2 tocsection-9"><a href="#Discovery_by_Robert_Ballard"><span class="tocnumber">4.1</span> <span class="toctext">Discovery by Robert Ballard</span></a></li>
<li class="toclevel-2 tocsection-10"><a href="#Subsequent_expeditions"><span class="tocnumber">4.2</span> <span class="toctext">Subsequent expeditions</span></a></li>
</ul>
</li>
<li class="toclevel-1 tocsection-11"><a href="#References_in_the_Wehrmachtbericht"><span class="tocnumber">5</span> <span class="toctext">References in the Wehrmachtbericht</span></a></li>
<li class="toclevel-1 tocsection-12"><a href="#Footnotes"><span class="tocnumber">6</span> <span class="toctext">Footnotes</span></a></li>
<li class="toclevel-1 tocsection-13"><a href="#References"><span class="tocnumber">7</span> <span class="toctext">References</span></a></li>
<li class="toclevel-1 tocsection-14"><a href="#Further_Reading"><span class="tocnumber">8</span> <span class="toctext">Further Reading</span></a></li>
</ul>
</td>
</tr>
</table>
becomes in debug/input/1safe.html:

Code:
<table class="toc" id="toc">
<tbody><tr>
<td>
<div id="toctitle">
<h2>Contents</h2>
</div>
<ul>
<li class="toclevel-1 tocsection-1"><a href="#Construction_and_characteristics"> </a></li>
<li class="toclevel-1 tocsection-2"><a href="#Service_history"> </a>
<ul>
<li class="toclevel-2 tocsection-3"><a href="#Operation_Rhein.C3.BCbung"> </a>
<ul>
<li class="toclevel-3 tocsection-4"><a href="#Battle_of_the_Denmark_Strait"> </a></li>
<li class="toclevel-3 tocsection-5"><a href="#The_chase"> </a></li>
<li class="toclevel-3 tocsection-6"><a href="#Sinking"> </a></li>
</ul>
</li>
</ul>
</li>
<li class="toclevel-1 tocsection-7"><a href="#Media_portrayals_of_sinking"> </a></li>
<li class="toclevel-1 tocsection-8"><a href="#Discovery_of_the_wreck"> </a>
<ul>
<li class="toclevel-2 tocsection-9"><a href="#Discovery_by_Robert_Ballard"> </a></li>
<li class="toclevel-2 tocsection-10"><a href="#Subsequent_expeditions"> </a></li>
</ul>
</li>
<li class="toclevel-1 tocsection-11"><a href="#References_in_the_Wehrmachtbericht"> </a></li>
<li class="toclevel-1 tocsection-12"><a href="#Footnotes"> </a></li>
<li class="toclevel-1 tocsection-13"><a href="#References"> </a></li>
<li class="toclevel-1 tocsection-14"><a href="#Further_Reading"> </a></li>
</ul>
</td>
</tr>
</tbody></table>
after it is passed through the input plugin.

I simplified the TOC html as much as possible, wrapped the simplest possible html around the API html found in source file/safe1.html.

What's happening here?

Any help disentangling this "messy" HTML would be much appreciated!

Fred
Attached Files
File Type: zip input.zip (33.1 KB, 334 views)
File Type: zip source_files.zip (71.6 KB, 366 views)
nimblebooks is offline   Reply With Quote
Old 02-21-2012, 01:09 AM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,842
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Once again, fix the broken html.
kovidgoyal is offline   Reply With Quote
Advert
Old 02-21-2012, 07:59 AM   #3
nimblebooks
Enthusiast
nimblebooks began at the beginning.
 
Posts: 28
Karma: 10
Join Date: May 2010
Device: Kindle
When you say "broken html", what do you mean?
nimblebooks is offline   Reply With Quote
Old 02-21-2012, 03:24 PM   #4
nimblebooks
Enthusiast
nimblebooks began at the beginning.
 
Posts: 28
Karma: 10
Join Date: May 2010
Device: Kindle
Actually, Calibre is doing remarkably well in parsing the html from the Mediawiki API, with the sole exception of the HTML for this "Contents" infobox, so I don't really see any need for additional changes to the html. Since no additional explanation of how the HTML input plugin works or what it expects seems to be available, I am now simply stripping out the problematic chunk of html from the original document before sending it to Calibre -- which is handling the rest of the document very nicely!
nimblebooks is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
HTML to ePub stripping out Content text nimblebooks Conversion 6 02-01-2012 01:50 AM
Problem with html -> Mobi conversion - html tags visible. khromov Calibre 9 08-06-2011 11:25 AM
NCX file generator (and html ToC and opf) GiorgioC Workshop 0 07-12-2011 06:55 AM
can't generate a toc from an html file p3aul Calibre 13 08-27-2010 05:44 AM
HTML Book + non HTML TOC to epub aarcane Calibre 4 03-02-2010 02:58 AM


All times are GMT -4. The time now is 06:15 AM.


MobileRead.com is a privately owned, operated and funded community.