MobileRead Forums

MobileRead Forums (https://www.mobileread.com/forums/index.php)
-   Conversion (https://www.mobileread.com/forums/forumdisplay.php?f=235)
-   -   HTML input plugin stripping text within toc tags in child html file (https://www.mobileread.com/forums/showthread.php?t=169732)

nimblebooks 02-20-2012 02:59 PM

HTML input plugin stripping text within toc tags in child html file
 
2 Attachment(s)
Hi,

Same problem as a while ago but have done some more testing. Files attached.

ebook-convert tmp/temptoc.html $mediatargetpath$sku".epub" --max-levels=1 --toc-threshold=6 --cover=$imagedir$sku$cover_image_extension --book-producer="Nimble Combinatorial Publishing" --publisher="Nimble Combinatorial Publishing" --max-toc-links=20 --preserve-cover-aspect-ratio -vv --debug-pipeline="debug" --duplicate-links-in-toc --chapter="/"

From debug, I can tell thathe conversion is getting messed up in the input plugin stage: the following HTML in the source file safe1.html generated from the API call to http://en.wikipedia.org/w/index.php?...eship_Bismarck

Code:

<table id="toc" class="toc">
<tr>
<td>
<div id="toctitle">
<h2>Contents</h2>
</div>
<ul>
<li class="toclevel-1 tocsection-1"><a href="#Construction_and_characteristics"><span class="tocnumber">1</span> <span class="toctext">Construction and characteristics</span></a></li>
<li class="toclevel-1 tocsection-2"><a href="#Service_history"><span class="tocnumber">2</span> <span class="toctext">Service history</span></a>
<ul>
<li class="toclevel-2 tocsection-3"><a href="#Operation_Rhein.C3.BCbung"><span class="tocnumber">2.1</span> <span class="toctext">Operation Rheinübung</span></a>
<ul>
<li class="toclevel-3 tocsection-4"><a href="#Battle_of_the_Denmark_Strait"><span class="tocnumber">2.1.1</span> <span class="toctext">Battle of the Denmark Strait</span></a></li>
<li class="toclevel-3 tocsection-5"><a href="#The_chase"><span class="tocnumber">2.1.2</span> <span class="toctext">The chase</span></a></li>
<li class="toclevel-3 tocsection-6"><a href="#Sinking"><span class="tocnumber">2.1.3</span> <span class="toctext">Sinking</span></a></li>
</ul>
</li>
</ul>
</li>
<li class="toclevel-1 tocsection-7"><a href="#Media_portrayals_of_sinking"><span class="tocnumber">3</span> <span class="toctext">Media portrayals of sinking</span></a></li>
<li class="toclevel-1 tocsection-8"><a href="#Discovery_of_the_wreck"><span class="tocnumber">4</span> <span class="toctext">Discovery of the wreck</span></a>
<ul>
<li class="toclevel-2 tocsection-9"><a href="#Discovery_by_Robert_Ballard"><span class="tocnumber">4.1</span> <span class="toctext">Discovery by Robert Ballard</span></a></li>
<li class="toclevel-2 tocsection-10"><a href="#Subsequent_expeditions"><span class="tocnumber">4.2</span> <span class="toctext">Subsequent expeditions</span></a></li>
</ul>
</li>
<li class="toclevel-1 tocsection-11"><a href="#References_in_the_Wehrmachtbericht"><span class="tocnumber">5</span> <span class="toctext">References in the Wehrmachtbericht</span></a></li>
<li class="toclevel-1 tocsection-12"><a href="#Footnotes"><span class="tocnumber">6</span> <span class="toctext">Footnotes</span></a></li>
<li class="toclevel-1 tocsection-13"><a href="#References"><span class="tocnumber">7</span> <span class="toctext">References</span></a></li>
<li class="toclevel-1 tocsection-14"><a href="#Further_Reading"><span class="tocnumber">8</span> <span class="toctext">Further Reading</span></a></li>
</ul>
</td>
</tr>
</table>

becomes in debug/input/1safe.html:

Code:

<table class="toc" id="toc">
<tbody><tr>
<td>
<div id="toctitle">
<h2>Contents</h2>
</div>
<ul>
<li class="toclevel-1 tocsection-1"><a href="#Construction_and_characteristics"> </a></li>
<li class="toclevel-1 tocsection-2"><a href="#Service_history"> </a>
<ul>
<li class="toclevel-2 tocsection-3"><a href="#Operation_Rhein.C3.BCbung"> </a>
<ul>
<li class="toclevel-3 tocsection-4"><a href="#Battle_of_the_Denmark_Strait"> </a></li>
<li class="toclevel-3 tocsection-5"><a href="#The_chase"> </a></li>
<li class="toclevel-3 tocsection-6"><a href="#Sinking"> </a></li>
</ul>
</li>
</ul>
</li>
<li class="toclevel-1 tocsection-7"><a href="#Media_portrayals_of_sinking"> </a></li>
<li class="toclevel-1 tocsection-8"><a href="#Discovery_of_the_wreck"> </a>
<ul>
<li class="toclevel-2 tocsection-9"><a href="#Discovery_by_Robert_Ballard"> </a></li>
<li class="toclevel-2 tocsection-10"><a href="#Subsequent_expeditions"> </a></li>
</ul>
</li>
<li class="toclevel-1 tocsection-11"><a href="#References_in_the_Wehrmachtbericht"> </a></li>
<li class="toclevel-1 tocsection-12"><a href="#Footnotes"> </a></li>
<li class="toclevel-1 tocsection-13"><a href="#References"> </a></li>
<li class="toclevel-1 tocsection-14"><a href="#Further_Reading"> </a></li>
</ul>
</td>
</tr>
</tbody></table>

after it is passed through the input plugin.

I simplified the TOC html as much as possible, wrapped the simplest possible html around the API html found in source file/safe1.html.

What's happening here?

Any help disentangling this "messy" HTML would be much appreciated!

Fred

kovidgoyal 02-21-2012 02:09 AM

Once again, fix the broken html.

nimblebooks 02-21-2012 08:59 AM

When you say "broken html", what do you mean?

nimblebooks 02-21-2012 04:24 PM

Actually, Calibre is doing remarkably well in parsing the html from the Mediawiki API, with the sole exception of the HTML for this "Contents" infobox, so I don't really see any need for additional changes to the html. Since no additional explanation of how the HTML input plugin works or what it expects seems to be available, I am now simply stripping out the problematic chunk of html from the original document before sending it to Calibre -- which is handling the rest of the document very nicely!


All times are GMT -4. The time now is 10:50 PM.

Powered by: vBulletin
Copyright ©2000 - 3.8.5, Jelsoft Enterprises Ltd.
MobileRead.com is a privately owned, operated and funded community.