Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre > Conversion

Notices

Reply
 
Thread Tools Search this Thread
Old 02-20-2012, 02:59 PM   #1
nimblebooks
Enthusiast
nimblebooks began at the beginning.
 
Posts: 28
Karma: 10
Join Date: May 2010
Device: Kindle
HTML input plugin stripping text within toc tags in child html file

Hi,

Same problem as a while ago but have done some more testing. Files attached.

ebook-convert tmp/temptoc.html $mediatargetpath$sku".epub" --max-levels=1 --toc-threshold=6 --cover=$imagedir$sku$cover_image_extension --book-producer="Nimble Combinatorial Publishing" --publisher="Nimble Combinatorial Publishing" --max-toc-links=20 --preserve-cover-aspect-ratio -vv --debug-pipeline="debug" --duplicate-links-in-toc --chapter="/"

From debug, I can tell thathe conversion is getting messed up in the input plugin stage: the following HTML in the source file safe1.html generated from the API call to http://en.wikipedia.org/w/index.php?...eship_Bismarck

Code:
<table id="toc" class="toc">
<tr>
<td>
<div id="toctitle">
<h2>Contents</h2>
</div>
<ul>
<li class="toclevel-1 tocsection-1"><a href="#Construction_and_characteristics"><span class="tocnumber">1</span> <span class="toctext">Construction and characteristics</span></a></li>
<li class="toclevel-1 tocsection-2"><a href="#Service_history"><span class="tocnumber">2</span> <span class="toctext">Service history</span></a>
<ul>
<li class="toclevel-2 tocsection-3"><a href="#Operation_Rhein.C3.BCbung"><span class="tocnumber">2.1</span> <span class="toctext">Operation Rheinübung</span></a>
<ul>
<li class="toclevel-3 tocsection-4"><a href="#Battle_of_the_Denmark_Strait"><span class="tocnumber">2.1.1</span> <span class="toctext">Battle of the Denmark Strait</span></a></li>
<li class="toclevel-3 tocsection-5"><a href="#The_chase"><span class="tocnumber">2.1.2</span> <span class="toctext">The chase</span></a></li>
<li class="toclevel-3 tocsection-6"><a href="#Sinking"><span class="tocnumber">2.1.3</span> <span class="toctext">Sinking</span></a></li>
</ul>
</li>
</ul>
</li>
<li class="toclevel-1 tocsection-7"><a href="#Media_portrayals_of_sinking"><span class="tocnumber">3</span> <span class="toctext">Media portrayals of sinking</span></a></li>
<li class="toclevel-1 tocsection-8"><a href="#Discovery_of_the_wreck"><span class="tocnumber">4</span> <span class="toctext">Discovery of the wreck</span></a>
<ul>
<li class="toclevel-2 tocsection-9"><a href="#Discovery_by_Robert_Ballard"><span class="tocnumber">4.1</span> <span class="toctext">Discovery by Robert Ballard</span></a></li>
<li class="toclevel-2 tocsection-10"><a href="#Subsequent_expeditions"><span class="tocnumber">4.2</span> <span class="toctext">Subsequent expeditions</span></a></li>
</ul>
</li>
<li class="toclevel-1 tocsection-11"><a href="#References_in_the_Wehrmachtbericht"><span class="tocnumber">5</span> <span class="toctext">References in the Wehrmachtbericht</span></a></li>
<li class="toclevel-1 tocsection-12"><a href="#Footnotes"><span class="tocnumber">6</span> <span class="toctext">Footnotes</span></a></li>
<li class="toclevel-1 tocsection-13"><a href="#References"><span class="tocnumber">7</span> <span class="toctext">References</span></a></li>
<li class="toclevel-1 tocsection-14"><a href="#Further_Reading"><span class="tocnumber">8</span> <span class="toctext">Further Reading</span></a></li>
</ul>
</td>
</tr>
</table>
becomes in debug/input/1safe.html:

Code:
<table class="toc" id="toc">
<tbody><tr>
<td>
<div id="toctitle">
<h2>Contents</h2>
</div>
<ul>
<li class="toclevel-1 tocsection-1"><a href="#Construction_and_characteristics"> </a></li>
<li class="toclevel-1 tocsection-2"><a href="#Service_history"> </a>
<ul>
<li class="toclevel-2 tocsection-3"><a href="#Operation_Rhein.C3.BCbung"> </a>
<ul>
<li class="toclevel-3 tocsection-4"><a href="#Battle_of_the_Denmark_Strait"> </a></li>
<li class="toclevel-3 tocsection-5"><a href="#The_chase"> </a></li>
<li class="toclevel-3 tocsection-6"><a href="#Sinking"> </a></li>
</ul>
</li>
</ul>
</li>
<li class="toclevel-1 tocsection-7"><a href="#Media_portrayals_of_sinking"> </a></li>
<li class="toclevel-1 tocsection-8"><a href="#Discovery_of_the_wreck"> </a>
<ul>
<li class="toclevel-2 tocsection-9"><a href="#Discovery_by_Robert_Ballard"> </a></li>
<li class="toclevel-2 tocsection-10"><a href="#Subsequent_expeditions"> </a></li>
</ul>
</li>
<li class="toclevel-1 tocsection-11"><a href="#References_in_the_Wehrmachtbericht"> </a></li>
<li class="toclevel-1 tocsection-12"><a href="#Footnotes"> </a></li>
<li class="toclevel-1 tocsection-13"><a href="#References"> </a></li>
<li class="toclevel-1 tocsection-14"><a href="#Further_Reading"> </a></li>
</ul>
</td>
</tr>
</tbody></table>
after it is passed through the input plugin.

I simplified the TOC html as much as possible, wrapped the simplest possible html around the API html found in source file/safe1.html.

What's happening here?

Any help disentangling this "messy" HTML would be much appreciated!

Fred
Attached Files
File Type: zip input.zip (33.1 KB, 39 views)
File Type: zip source_files.zip (71.6 KB, 48 views)
nimblebooks is offline   Reply With Quote
Old 02-21-2012, 02:09 AM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 26,469
Karma: 5383257
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Once again, fix the broken html.
kovidgoyal is online now   Reply With Quote
Old 02-21-2012, 08:59 AM   #3
nimblebooks
Enthusiast
nimblebooks began at the beginning.
 
Posts: 28
Karma: 10
Join Date: May 2010
Device: Kindle
When you say "broken html", what do you mean?
nimblebooks is offline   Reply With Quote
Old 02-21-2012, 04:24 PM   #4
nimblebooks
Enthusiast
nimblebooks began at the beginning.
 
Posts: 28
Karma: 10
Join Date: May 2010
Device: Kindle
Actually, Calibre is doing remarkably well in parsing the html from the Mediawiki API, with the sole exception of the HTML for this "Contents" infobox, so I don't really see any need for additional changes to the html. Since no additional explanation of how the HTML input plugin works or what it expects seems to be available, I am now simply stripping out the problematic chunk of html from the original document before sending it to Calibre -- which is handling the rest of the document very nicely!
nimblebooks is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
HTML to ePub stripping out Content text nimblebooks Conversion 6 02-01-2012 02:50 AM
Problem with html -> Mobi conversion - html tags visible. khromov Calibre 9 08-06-2011 12:25 PM
NCX file generator (and html ToC and opf) GiorgioC Workshop 0 07-12-2011 07:55 AM
can't generate a toc from an html file p3aul Calibre 13 08-27-2010 06:44 AM
HTML Book + non HTML TOC to epub aarcane Calibre 4 03-02-2010 03:58 AM


All times are GMT -4. The time now is 12:43 PM.


MobileRead.com is a privately owned, operated and funded community.