![]() |
#1 |
Enthusiast
![]() Posts: 28
Karma: 10
Join Date: May 2010
Device: Kindle
|
HTML input plugin stripping text within toc tags in child html file
Hi,
Same problem as a while ago but have done some more testing. Files attached. ebook-convert tmp/temptoc.html $mediatargetpath$sku".epub" --max-levels=1 --toc-threshold=6 --cover=$imagedir$sku$cover_image_extension --book-producer="Nimble Combinatorial Publishing" --publisher="Nimble Combinatorial Publishing" --max-toc-links=20 --preserve-cover-aspect-ratio -vv --debug-pipeline="debug" --duplicate-links-in-toc --chapter="/" From debug, I can tell thathe conversion is getting messed up in the input plugin stage: the following HTML in the source file safe1.html generated from the API call to http://en.wikipedia.org/w/index.php?...eship_Bismarck Code:
<table id="toc" class="toc"> <tr> <td> <div id="toctitle"> <h2>Contents</h2> </div> <ul> <li class="toclevel-1 tocsection-1"><a href="#Construction_and_characteristics"><span class="tocnumber">1</span> <span class="toctext">Construction and characteristics</span></a></li> <li class="toclevel-1 tocsection-2"><a href="#Service_history"><span class="tocnumber">2</span> <span class="toctext">Service history</span></a> <ul> <li class="toclevel-2 tocsection-3"><a href="#Operation_Rhein.C3.BCbung"><span class="tocnumber">2.1</span> <span class="toctext">Operation Rheinübung</span></a> <ul> <li class="toclevel-3 tocsection-4"><a href="#Battle_of_the_Denmark_Strait"><span class="tocnumber">2.1.1</span> <span class="toctext">Battle of the Denmark Strait</span></a></li> <li class="toclevel-3 tocsection-5"><a href="#The_chase"><span class="tocnumber">2.1.2</span> <span class="toctext">The chase</span></a></li> <li class="toclevel-3 tocsection-6"><a href="#Sinking"><span class="tocnumber">2.1.3</span> <span class="toctext">Sinking</span></a></li> </ul> </li> </ul> </li> <li class="toclevel-1 tocsection-7"><a href="#Media_portrayals_of_sinking"><span class="tocnumber">3</span> <span class="toctext">Media portrayals of sinking</span></a></li> <li class="toclevel-1 tocsection-8"><a href="#Discovery_of_the_wreck"><span class="tocnumber">4</span> <span class="toctext">Discovery of the wreck</span></a> <ul> <li class="toclevel-2 tocsection-9"><a href="#Discovery_by_Robert_Ballard"><span class="tocnumber">4.1</span> <span class="toctext">Discovery by Robert Ballard</span></a></li> <li class="toclevel-2 tocsection-10"><a href="#Subsequent_expeditions"><span class="tocnumber">4.2</span> <span class="toctext">Subsequent expeditions</span></a></li> </ul> </li> <li class="toclevel-1 tocsection-11"><a href="#References_in_the_Wehrmachtbericht"><span class="tocnumber">5</span> <span class="toctext">References in the Wehrmachtbericht</span></a></li> <li class="toclevel-1 tocsection-12"><a href="#Footnotes"><span class="tocnumber">6</span> <span class="toctext">Footnotes</span></a></li> <li class="toclevel-1 tocsection-13"><a href="#References"><span class="tocnumber">7</span> <span class="toctext">References</span></a></li> <li class="toclevel-1 tocsection-14"><a href="#Further_Reading"><span class="tocnumber">8</span> <span class="toctext">Further Reading</span></a></li> </ul> </td> </tr> </table> Code:
<table class="toc" id="toc"> <tbody><tr> <td> <div id="toctitle"> <h2>Contents</h2> </div> <ul> <li class="toclevel-1 tocsection-1"><a href="#Construction_and_characteristics"> </a></li> <li class="toclevel-1 tocsection-2"><a href="#Service_history"> </a> <ul> <li class="toclevel-2 tocsection-3"><a href="#Operation_Rhein.C3.BCbung"> </a> <ul> <li class="toclevel-3 tocsection-4"><a href="#Battle_of_the_Denmark_Strait"> </a></li> <li class="toclevel-3 tocsection-5"><a href="#The_chase"> </a></li> <li class="toclevel-3 tocsection-6"><a href="#Sinking"> </a></li> </ul> </li> </ul> </li> <li class="toclevel-1 tocsection-7"><a href="#Media_portrayals_of_sinking"> </a></li> <li class="toclevel-1 tocsection-8"><a href="#Discovery_of_the_wreck"> </a> <ul> <li class="toclevel-2 tocsection-9"><a href="#Discovery_by_Robert_Ballard"> </a></li> <li class="toclevel-2 tocsection-10"><a href="#Subsequent_expeditions"> </a></li> </ul> </li> <li class="toclevel-1 tocsection-11"><a href="#References_in_the_Wehrmachtbericht"> </a></li> <li class="toclevel-1 tocsection-12"><a href="#Footnotes"> </a></li> <li class="toclevel-1 tocsection-13"><a href="#References"> </a></li> <li class="toclevel-1 tocsection-14"><a href="#Further_Reading"> </a></li> </ul> </td> </tr> </tbody></table> I simplified the TOC html as much as possible, wrapped the simplest possible html around the API html found in source file/safe1.html. What's happening here? Any help disentangling this "messy" HTML would be much appreciated! Fred |
![]() |
![]() |
![]() |
#2 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,110
Karma: 27110892
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Once again, fix the broken html.
|
![]() |
![]() |
Advert | |
|
![]() |
#3 |
Enthusiast
![]() Posts: 28
Karma: 10
Join Date: May 2010
Device: Kindle
|
When you say "broken html", what do you mean?
|
![]() |
![]() |
![]() |
#4 |
Enthusiast
![]() Posts: 28
Karma: 10
Join Date: May 2010
Device: Kindle
|
Actually, Calibre is doing remarkably well in parsing the html from the Mediawiki API, with the sole exception of the HTML for this "Contents" infobox, so I don't really see any need for additional changes to the html. Since no additional explanation of how the HTML input plugin works or what it expects seems to be available, I am now simply stripping out the problematic chunk of html from the original document before sending it to Calibre -- which is handling the rest of the document very nicely!
|
![]() |
![]() |
![]() |
Thread Tools | Search this Thread |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
HTML to ePub stripping out Content text | nimblebooks | Conversion | 6 | 02-01-2012 01:50 AM |
Problem with html -> Mobi conversion - html tags visible. | khromov | Calibre | 9 | 08-06-2011 11:25 AM |
NCX file generator (and html ToC and opf) | GiorgioC | Workshop | 0 | 07-12-2011 06:55 AM |
can't generate a toc from an html file | p3aul | Calibre | 13 | 08-27-2010 05:44 AM |
HTML Book + non HTML TOC to epub | aarcane | Calibre | 4 | 03-02-2010 02:58 AM |