MobileRead Forums

MobileRead Forums (https://www.mobileread.com/forums/index.php)
-   Calibre (https://www.mobileread.com/forums/forumdisplay.php?f=166)
-   -   can't generate a toc from an html file (https://www.mobileread.com/forums/showthread.php?t=95528)

p3aul 08-22-2010 06:20 AM

can't generate a toc from an html file
 
I am trying to convert an html file( Gideons' Band) to an Epub from the command line or the GUI but It won't generate a TOC. The file is in the same format as one I did successfully convert(Bonaventure). The files are located on Project Gutenberg. here is the URL

Gideon's Band: http://www.gutenberg.org/files/19348...-h/19348-h.htm

Bonaventure: http://www.gutenberg.org/files/24078...-h/24078-h.htm

To reiterate, Bonaventure was fine, Gideon's Band would not generate the TOC the files are stated by PG to be in the public domain.

TIA,
Paul

jackie_w 08-22-2010 08:43 AM

Hi Paul,

I've had a look inside the HTML of Gideon's Band. The chapter tagging looks like this:

Code:

  <div class="c1">
    <h1>GIDEON'S BAND</h1>

    <h2>I</h2>

    <h3>THE STEAMBOAT LEVEE</h3>
  </div>

... ...

  <div class="c1">
    <h2>II</h2>

    <h3>THE "VOTARESS"</h3>
  </div>

The easiest way to get a TOC in your EPUB is to

Set [Convert] - [Structure Detection] - 'Detect chapters at' to //h:h2
Set [Convert] - [Table of Contents] - 'Level 1 TOC' to //h:h2 (or you could leave it blank in this particular case)

or you may prefer
Set [Convert] - [Structure Detection] - 'Detect chapters at' to //h:div[re:test(@class, "c1", "i")]
Set [Convert] - [Table of Contents] - 'Level 1 TOC' to //h:h2

or even
Set [Convert] - [Structure Detection] - 'Detect chapters at' to //h:div[re:test(@class, "c1", "i")]
Set [Convert] - [Table of Contents] - 'Level 1 TOC' to //h:div[re:test(@class, "c1", "i")]

All of these worked for me. Good Luck. :)

p3aul 08-23-2010 01:09 AM

Well it did generate all the chapter titles, but not the correct pages! :(

Paul

jackie_w 08-23-2010 07:24 AM

To help you further you need to be more specific about what isn't working.

p3aul 08-25-2010 02:14 AM

OK I don't know in detail. All I know is that on the command line I type: ebook-convert gideon.html gideon epub. This is supposed to convert the file to epub, right. I understand calibre looks for the <h1>, <h2>, markup tags to create a TOC they are there. It doesn't. I've also tried to convert in the GUI. Same thing.

I tried your suggestions I replace the default settings with yours. I get a complete listing of chapters but the pages are wrong. most of them just take you back to page 3 for some reason.

I converted the book to rtf. This time all I got for a TOC was links to illustrations. Tomorrow I will delete the illustrations entirely and try again.

jackie_w 08-25-2010 10:51 AM

2 Attachment(s)
Firstly, sticking with the GUI for the moment. I've converted the HTML to epub using option 2 of the three I listed above.

I have attached a screencap of the resulting epub when viewed on the PC using the calibre ebook viewer.

When I open the TOC panel (left-hand side 7th button from top), I see a list of all 63 chapters. If I click on one it takes me straight there. Screencap shows Chapter 2 selected.

Once sent to the PRS505 using GUI 'Send to Device', I select the book and press my 505's TOC button (button 5). It lists all the chapters and I can select whichever I want. I have also attached a screencap of the 505's TOC.

Which of these differs from your own experience?

Secondly, if you are trying to use the 'inline TOC' (i.e. the one with hyperlinks which is actually contained in the early pages of the book) then you will find that the HTML has coded the labels BEFORE the <div> and <h2> tags. Consequently when you press a hyperlink it will take you to a point just before your chapter heading and you will need to turn to next page to get to the selected chapter heading. Personally, I find these inline TOCs more trouble than they're worth.

Thirdly, I don't use the commandline version of ebook-convert myself but I do know that it has a large number of options which need to be set to customise your conversion. Here's a link to the relevant part of the User Manual.

... and finally... I'm not sure how removing the images will solve your problem. They show up fine on my 505.

p3aul 08-25-2010 07:30 PM

I confess, I haven't tried your option 2 yet, only 1.

the second time I tried, with option 1 resulted in all the chapter headings but if you pressed the appropriate key on the 505, it mostly always sent you to page 3.

I use ebook-convert because Calibre lease so many child processes running when it exits, that it slows down my computer. I tried to just copy the epub to my external card on the 505, but it leaves the metadata behind, so I have to use the GUI to copy the epub to the 505.

I refer to the manual(ebook-convert, so much I have a link to it on my Chrome toolbar! Also using the command line, it's easier to trouble-shoot when things go wrong.

I only tried removing the "links" to the images in the html, not the images themselves. I thought if I remove the links, it might fall through to the chapter-headings.
IMPORTANT:
From the Calibre manual:
--level1-toc
XPath expression that specifies all tags that should be added to the Table of Contents at level one. If this is specified, it takes precedence over other forms of auto-detection.

Does this mean a complete xpath expression as in the "Structure Detection" in the GUI Convert books, or just a partial one like "//h1"


Thanks,
Paul

jackie_w 08-25-2010 08:26 PM

Quote:

Originally Posted by p3aul (Post 1075305)
I confess, I haven't tried your option 2 yet, only 1.

All 3 options give very similar results. The only reason I used option 2 was that it centred the chapter headings and opt 1 didn't. (Opt 3 adds the chapter name to the TOC - which I thought was a bit cluttered) but it's personal preference.

Quote:

Originally Posted by p3aul (Post 1075305)
the second time I tried, with option 1 resulted in all the chapter headings but if you pressed the appropriate key on the 505, it mostly always sent you to page 3.

I cannot reproduce this problem. It works perfectly for me.

Quote:

Originally Posted by p3aul (Post 1075305)
I use ebook-convert because Calibre lease so many child processes running when it exits, that it slows down my computer. I tried to just copy the epub to my external card on the 505, but it leaves the metadata behind, so I have to use the GUI to copy the epub to the 505.

I refer to the manual(ebook-convert, so much I have a link to it on my Chrome toolbar! Also using the command line, it's easier to trouble-shoot when things go wrong.

I only tried removing the "links" to the images in the html, not the images themselves. I thought if I remove the links, it might fall through to the chapter-headings.
IMPORTANT:
From the Calibre manual:
--level1-toc
XPath expression that specifies all tags that should be added to the Table of Contents at level one. If this is specified, it takes precedence over other forms of auto-detection.

Does this mean a complete xpath expression as in the "Structure Detection" in the GUI Convert books, or just a partial one like "//h1"

As I said, I don't use this method myself, but I tried this as a no-bells-or-whistles commandline approximation to opt 1 and it seems to work:
Code:

ebook-convert "Gideon's Band - George W Cable.zip" gb2.epub --chapter "//h:h2" --level1-toc "//h:h2"
where "Gideon's Band - George W Cable.zip" is the resulting file in my calibre library after drag-drop of the source html file into calibre.

DoctorOhh 08-26-2010 05:52 AM

1 Attachment(s)
Quote:

Originally Posted by p3aul (Post 1075305)
I use ebook-convert because Calibre lease so many child processes running when it exits, that it slows down my computer. I tried to just copy the epub to my external card on the 505, but it leaves the metadata behind, so I have to use the GUI to copy the epub to the 505.

Calibre doesn't leave any processes running if you exit the program.

Of course if you have the Enable system tray icon feature checked under Preferences - Interface, then you have to use ctrl-q to exit the program. Just clicking on the big red X just minimizes calibre to the system tray. See attached.

You can also go to Preferences - Advanced and lower the number of worker processes.

p3aul 08-26-2010 06:30 PM

Jackie:

Quote:

ebook-convert "Gideon's Band - George W Cable.zip" gb2.epub --chapter "//h:h2" --level1-toc "//h:h2"
Now that one did the trick! Just what I was looking for.

Im curious though. In all the stuff I've read here, I thought the xpath thingy was just "//h2" not "//h:h2" Is there a reason for typing "//h:h2" and NOT "//h2"? Just curious..

Thanks,
Paul

Walt:
Well, it's impolite to argue, but I know for a fact it does, either way. I guess if Adobe can ignore memory leaks in every version of PS up to cs3 I guess I'll have to put up with the processes. It's the only game in town and besides I only use the GUI to transfer the books to my reader. I could use Sony for that I guess. When neither program is perfect, you have to use a bit of each I guess. It's no secret, Kovid knows the way I feel about the GUI. I'm just thankful for his command-line programs. If you can remember MS Dos 3.2, the command line isn't so bad.

jackie_w 08-26-2010 08:57 PM

Quote:

Originally Posted by p3aul (Post 1077240)
Jackie:
Now that one did the trick! Just what I was looking for.

Hurrah! :)

Quote:

Originally Posted by p3aul (Post 1077240)
Im curious though. In all the stuff I've read here, I thought the xpath thingy was just "//h2" not "//h:h2" Is there a reason for typing "//h:h2" and NOT "//h2"? Just curious..

Er... I have no real understanding of XPath, I have to let the GUI Wizard (the Harry Potter magic wand button) generate my XPath for me. If you select a heading tag it always puts the //h: in front of it. If you select a div tag it becomes //h:div I guess one would have to set to with an XPath manual to understand it fully.

DoctorOhh 08-26-2010 09:59 PM

Quote:

Originally Posted by p3aul (Post 1077240)
Walt:
Well, it's impolite to argue, but I know for a fact it does, either way. I guess if Adobe can ignore memory leaks in every version of PS up to cs3 I guess I'll have to put up with the processes. It's the only game in town and besides I only use the GUI to transfer the books to my reader. I could use Sony for that I guess. When neither program is perfect, you have to use a bit of each I guess. It's no secret, Kovid knows the way I feel about the GUI. I'm just thankful for his command-line programs. If you can remember MS Dos 3.2, the command line isn't so bad.

Ok, I guess I can state it leaves no processes running on my Win XP machine. What OS do you run?

p3aul 08-26-2010 11:56 PM

You know, I guess the "h:" is just a signal to the pre-processor that an h tag is coming. I should play with the GUI more I guess, I had forgotten he had included the Wizard! It really is a wonderful program. The problem with so many of the PG books is they just use the original's way of marking chapters. If the first word in a chapter heading is chapter, it's easy. I just load the text file in Word and do a search and replace, replacing all Chapter words with ##Chapter, and then convert.

Anyway, Thanks for helping me,
Paul

p3aul 08-27-2010 06:44 AM

Quote:

Im curious though. In all the stuff I've read here, I thought the xpath thingy was just "//h2" not "//h:h2" Is there a reason for typing "//h:h2" and NOT "//h2"? Just curious..
From the Calibre Xpath Tutorial:
Quote:

The h: prefix in the above examples is needed to match XHTML tags. This is because internally, calibre represents all content as XHTML. In XHTML tags have a namespace, and h: is the namespace prefix for HTML tags.
Well I guess that cleared that up!:smack:


All times are GMT -4. The time now is 10:56 PM.

Powered by: vBulletin
Copyright ©2000 - 3.8.5, Jelsoft Enterprises Ltd.
MobileRead.com is a privately owned, operated and funded community.