Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre > Conversion

Notices

Reply
 
Thread Tools Search this Thread
Old 02-19-2011, 11:21 PM   #1
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
Chapter Detection Tutorial

Since this seems to come up over and over again, I think it might be good to have a tutorial. Please provide feedback and once it looks good I'll sticky it.

Getting Calibre to appropriately detect Chapters and build a TOC requires some relatively simple examination of your book's html source code. This is required though it you want to be able to navigate Chapters on your Reader - e.g. the TOC viewer on epub readers, or the Kindle's 5-way controller.

Under the conversion options, go to Search and Replace. Click one of the magic wands on the right half of the screen. If you have multiple source formats Calibre will ask you to choose one - be sure to choose the correct one. Your book's html code will pop up in a new window.

Start scanning through the html code for your chapter headings. You can generally find one quite easily, but if you're having trouble try searching for the plain text that you see when viewing the chapter heading in a ebook reader/web browser.

There are two basic situations you'll run into at this point - the book has clearly defined chapter headings, or it doesn't. There are different ways of handling each case.



Well defined chapter headings:
A well defined chapter heading will typically have code that looks something like this:
Code:
<div class="chapter"></div><div><h3><a name="ch05" id="ch05">5</a> <br /><br/><br /></h3></div><p class="fl1">My nagging got the better o
In this case the chapter heading is just the number '5' In this example, all the book's headings are just numbers like this. When you look through the html code you can see these are wrapped with '<h3>' heading tags:
Code:
<h3 class="calibre6"><a name="ch05" class="calibre9" id="ch05">5</a> <br class="calibre3"/><br class="calibre3"/><br class="calibre3"/></h3>
Other books could use <h1>, <h2>, <h4>, etc - this is why the source code needs to be examined - to figure out what's being used.

There is a box in the structure detection panel of conversion where you can configure an xpath to detect chapters, the default is this:
Code:
//*[((name()='h1' or name()='h2') and re:test(., 'chapter|book|section|part\s+', 'i')) or @class = 'chapter']
Note that only looks for h1 or h2 tags, but in our example we need h3 tags. It also has a regex that looks for the words chapter, book, section, or part, but we need numbers, which can be represented as '\d+'. If you're book's chapters just use varying words then you could use '.*'

So we can just change that xpath to this:
Code:
//*[((name()='h1' or name()='h3') and re:test(., '\d+', 'i')) or @class = 'chapter']
And now Calibre will create a TOC. If you're book uses <h4>, <h5> or something else, change the xpath appropriately.

If all the chapter tags in the book are h3 tags or similar, you could also click on the little magic wand icon next to the xpath, and just type 'h3' or it's equivalent into the first box - even simpler.


Poorly defined chapter headings:
Here's an example of a poorly defined chapter heading:
Code:
<p class="MsoNormal" align="center" style="mso-margin-top-alt:auto;mso-margin-bottom-alt: auto;text-align:center;line-height:normal"><span style="font-size:14.0pt; font-family:&quot;Times New Roman&quot;;mso-fareast-font-family:&quot;Times New Roman&quot;; color:black"></span></p>
<p class="MsoNormal" align="center" style="mso-margin-top-alt:auto;mso-margin-bottom-alt: auto;text-align:center;line-height:normal"><span style="font-size:14.0pt; font-family:&quot;Times New Roman&quot;;mso-fareast-font-family:&quot;Times New Roman&quot;; color:black">Chapter 2</span></p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto; line-height:normal"><span style="font-size:14.0pt;font-family:&quot;Times New Roman&quot;; mso-fareast-font-family:&quot;Times New Roman&quot;;color:black">The incredulous look must have been plain on my face. As she realized how her offer sounded, her
In this case, the chapter is just in a <p> tag, which is the same way plain text is treated in most ebooks. Getting Calibre to create a TOC with the same technique we used before won't work.

Generally the best solution for this type of chapter heading is to go into the Heuristic Processing panel of Calibre's conversion options and enable Heuristics. Heuristics will search for common types of chapter headings and wrap them with <h2> tags.

Now you can go into structure detection, click the magic wand next to the Chapter detection xpath, and just type 'h2' into the first box. Calibre should create a table of contents for this type of scenario.


Nothing worked, I'm getting Desperate
If neither of the above solutions for you is working, convert to epub and edit your book in Sigil. Using Sigil you can mark your Chapter headings manually (or possibly using Sigil's search and replace). Once you've finished, use Calibre to convert your new epub to your desired destination format - Calibre will preserve the TOC that was created by Sigil when it converts to the new format.

Last edited by ldolse; 02-19-2011 at 11:32 PM.
ldolse is offline   Reply With Quote
Old 02-20-2011, 01:06 PM   #2
theducks
Well trained by Cats
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 29,689
Karma: 54369090
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
The above examples at least had the key word "Chapter", Which I believe Calibre factory defaults will trap correctly

Other common: chapter headings: Just digits or Roman Numerals, which are fairly easy to spot and trap with a REGEX expression.

If JUST the numbers are Spelled out (Seventeen), Or the Chapters just have words (A Golden Harvest") things get tricky. Hopefully you will find a unique pattern before and after, you trap on.

(I use trap, to mean a Search pattern match)
theducks is online now   Reply With Quote
Advert
Old 02-21-2011, 02:59 AM   #3
cybmole
Wizard
cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.
 
Posts: 3,720
Karma: 1759970
Join Date: Sep 2010
Device: none
the hardest case seems to be p tags around upper case ONE TWO THREE etc, progressing up to TWENTY-ONE ....
I worked out the letter set one time, which is a subset of the upper case alphabet, then used find & replace ( stepping through manually to be safe) the "can't be bothered to do that properly" set is [-E_Y]
cybmole is offline   Reply With Quote
Old 02-21-2011, 03:26 AM   #4
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
Maybe the tutorial didn't make handling that kind of situation clear, if your chapters use words describing numbers like your or the theducks example then you can use '.*' or '[A-Z' ]+' in the chapter detection xpath.

Alternatively, if all chapters use the exact same heading tag, and that heading isn't used elsewhere, you can just configure Calibre to build the TOC based 'only' on the heading tag itself, regardless of the contents of the tag.

I'll try to make those cases a bit more obvious.

If the chapter headings are all lower case numbers in <p> tags you're sort of out of luck, e.g.:
<p>Seventeen</p>

There isn't any good pattern there except for the fact that it's a short word/phrase without puncuation. When all else fails Heuristics will actually look for points like that and add page breaks, but it won't wrap them in <h2> tags because of a higher chance of false positives.

However if it's like this:
<h3>Seventeen</h3>

Then you can just use '.*' as you're regex, or just bypass regex altogether and just use '//h:h3' in the xpath box.

Last edited by ldolse; 02-21-2011 at 03:33 AM.
ldolse is offline   Reply With Quote
Old 02-21-2011, 03:44 AM   #5
Piper_
~~~~~
Piper_ ought to be getting tired of karma fortunes by now.Piper_ ought to be getting tired of karma fortunes by now.Piper_ ought to be getting tired of karma fortunes by now.Piper_ ought to be getting tired of karma fortunes by now.Piper_ ought to be getting tired of karma fortunes by now.Piper_ ought to be getting tired of karma fortunes by now.Piper_ ought to be getting tired of karma fortunes by now.Piper_ ought to be getting tired of karma fortunes by now.Piper_ ought to be getting tired of karma fortunes by now.Piper_ ought to be getting tired of karma fortunes by now.Piper_ ought to be getting tired of karma fortunes by now.
 
Piper_'s Avatar
 
Posts: 761
Karma: 1278391
Join Date: Aug 2010
Location: USA
Device: Kindle 3, Sony 350
I was just about to crash when I saw this thread, ldolse. I'm too brain dead to think straight right now, but I look forward to it tomorrow, and to you for doing it!
Piper_ is offline   Reply With Quote
Advert
Old 02-21-2011, 03:48 AM   #6
cybmole
Wizard
cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.
 
Posts: 3,720
Karma: 1759970
Join Date: Sep 2010
Device: none
can you add the case of images being used as chapter headers to your excellent tutorial also please ( as per my current other thread on this )
cybmole is offline   Reply With Quote
Old 02-21-2011, 04:41 AM   #7
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
I'll add images, though I have a hunch that there isn't any really clean way to handle that case - you can build a TOC exactly as I described in the other thread, but the entries themselves in the TOC will be empty - I don't know if Calibre will look at 'title' or 'alt' tags if they exist, I sort of doubt it. This might be enough for a lot of Kindle users though, as the 5 way controller will still work. In Sigil it's possible with by using 'title' tags, but of course that requires hand editing.
ldolse is offline   Reply With Quote
Old 02-25-2011, 04:20 PM   #8
qxlooper
Member
qxlooper began at the beginning.
 
Posts: 12
Karma: 10
Join Date: Feb 2011
Newbie here and have tried following this post. Using a PDF, I am trying to get a TOC.
I looked and I have <h2> </h2>, but there is a span in there also.

Any help would greatly be appreciated!
qxlooper is offline   Reply With Quote
Old 02-25-2011, 04:24 PM   #9
qxlooper
Member
qxlooper began at the beginning.
 
Posts: 12
Karma: 10
Join Date: Feb 2011
Here is I am looking at. Please help!

<p><font size="+1"><span class="bold">Chapter 1 Flight Crew Duties and Responsibilities Section 1 Normal Operations</span> Volume 2</font></p>
<h2 id="filepos167405"><span class="bold">Chapter 1 <a id="filepos167422"/>Flight Crew Duties and Responsibilities</span></h2>
qxlooper is offline   Reply With Quote
Old 02-25-2011, 04:39 PM   #10
qxlooper
Member
qxlooper began at the beginning.
 
Posts: 12
Karma: 10
Join Date: Feb 2011
Also this from Sigil:

<h2 class="calibre5" id="calibre_pb_15">Chapter 1A Station Information</h2>
qxlooper is offline   Reply With Quote
Old 02-25-2011, 04:54 PM   #11
theducks
Well trained by Cats
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 29,689
Karma: 54369090
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
Quote:
Originally Posted by qxlooper View Post
Here is I am looking at. Please help!

<p><font size="+1"><span class="bold">Chapter 1 Flight Crew Duties and Responsibilities Section 1 Normal Operations</span> Volume 2</font></p>
<h2 id="filepos167405"><span class="bold">Chapter 1 <a id="filepos167422"/>Flight Crew Duties and Responsibilities</span></h2>
I am not sure what you want.

If you edit in Sigil:
The first is just a paragraph
If you want to make it a Level 1, replace:
Code:
<p><font size="+1"><span class="bold">
with
Code:
<h1>
and change the trailing </p> to </h1>

the second is ALREADY a valid second level (the level is 2 if after a H1 somewhere, otherwise it is top level)
Sigil will build your TOC from the H1, h2,h3 tags
theducks is online now   Reply With Quote
Old 02-25-2011, 04:55 PM   #12
theducks
Well trained by Cats
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 29,689
Karma: 54369090
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
Quote:
Originally Posted by qxlooper View Post
Also this from Sigil:

<h2 class="calibre5" id="calibre_pb_15">Chapter 1A Station Information</h2>
Valid for a level 2 TOC
theducks is online now   Reply With Quote
Old 02-25-2011, 06:22 PM   #13
qxlooper
Member
qxlooper began at the beginning.
 
Posts: 12
Karma: 10
Join Date: Feb 2011
I guess what I am asking is how do I do this? I have no clue in either of the programs to make it happen. In Calibre, I do the test function and it finds every h in the document. What, and where do I put where?

The things I posted are from the file. Don't know how to capture the chapter parts. I posted what I was seeing, not knowing how to capture it to make a TOC.

Thanks for helping the newbie!

Keith

I want to just capture the chapters. There are subsections, but just want the chapters to start, maybe the sub sections later. But for now just the chapters.

Last edited by qxlooper; 02-25-2011 at 06:25 PM.
qxlooper is offline   Reply With Quote
Old 02-25-2011, 09:08 PM   #14
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
Your other posts are a bit confusing as you're talking about using Sigil and Calibre, but you're not saying where any of the examples you're pasting in came from. - i.e. 'exactly' how you got that text and pasted it into these forums.

If this line:
<h2 id="filepos167405"><span class="bold">Chapter 1 <a id="filepos167422"/>Flight Crew Duties and Responsibilities</span></h2>

Was created by/seen using Calibre, then the default chapter detection xpath will work - you shouldn't have to make any changes from defaults. However if you did that using Sigil then you need to finish what you started and do that for all the chapters.


You're not going to be able to easily avoid the sub-chapters - are you saying that you're successfully getting a TOC from Calibre but it's the sub-chapters you don't want?

Last edited by ldolse; 02-26-2011 at 12:17 AM.
ldolse is offline   Reply With Quote
Old 02-25-2011, 11:54 PM   #15
qxlooper
Member
qxlooper began at the beginning.
 
Posts: 12
Karma: 10
Join Date: Feb 2011
I am getting nothing! I get a TOC with a start page. Nothing else. that is why I am lost as to what I should do. The first is from calibre, the second from sigil. I tried the default in calibre and it bookmarked everything except what I wanted it to and created over 2000+ bookmarks because anything that had an h in it was book marked.

I guess I just need an idiots guide to making a toc from a pdf/epub. If anyone wants the challenge, I would be willing to pay to have it done!

Keith
qxlooper is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Help with Chapter detection ubergeeksov Calibre 0 09-02-2010 04:56 AM
xpath for chapter detection romnempire Calibre 7 07-26-2010 05:34 PM
chapter detection in any book yuki86 Calibre 9 05-06-2009 06:54 AM
Chapter detection for LRF HenryP Calibre 12 04-03-2009 08:22 AM
Calibre chapter detection AKninja04 Calibre 5 09-14-2008 12:09 PM


All times are GMT -4. The time now is 09:59 AM.


MobileRead.com is a privately owned, operated and funded community.