View Single Post
Old 02-19-2011, 11:21 PM   #1
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
Chapter Detection Tutorial

Since this seems to come up over and over again, I think it might be good to have a tutorial. Please provide feedback and once it looks good I'll sticky it.

Getting Calibre to appropriately detect Chapters and build a TOC requires some relatively simple examination of your book's html source code. This is required though it you want to be able to navigate Chapters on your Reader - e.g. the TOC viewer on epub readers, or the Kindle's 5-way controller.

Under the conversion options, go to Search and Replace. Click one of the magic wands on the right half of the screen. If you have multiple source formats Calibre will ask you to choose one - be sure to choose the correct one. Your book's html code will pop up in a new window.

Start scanning through the html code for your chapter headings. You can generally find one quite easily, but if you're having trouble try searching for the plain text that you see when viewing the chapter heading in a ebook reader/web browser.

There are two basic situations you'll run into at this point - the book has clearly defined chapter headings, or it doesn't. There are different ways of handling each case.



Well defined chapter headings:
A well defined chapter heading will typically have code that looks something like this:
Code:
<div class="chapter"></div><div><h3><a name="ch05" id="ch05">5</a> <br /><br/><br /></h3></div><p class="fl1">My nagging got the better o
In this case the chapter heading is just the number '5' In this example, all the book's headings are just numbers like this. When you look through the html code you can see these are wrapped with '<h3>' heading tags:
Code:
<h3 class="calibre6"><a name="ch05" class="calibre9" id="ch05">5</a> <br class="calibre3"/><br class="calibre3"/><br class="calibre3"/></h3>
Other books could use <h1>, <h2>, <h4>, etc - this is why the source code needs to be examined - to figure out what's being used.

There is a box in the structure detection panel of conversion where you can configure an xpath to detect chapters, the default is this:
Code:
//*[((name()='h1' or name()='h2') and re:test(., 'chapter|book|section|part\s+', 'i')) or @class = 'chapter']
Note that only looks for h1 or h2 tags, but in our example we need h3 tags. It also has a regex that looks for the words chapter, book, section, or part, but we need numbers, which can be represented as '\d+'. If you're book's chapters just use varying words then you could use '.*'

So we can just change that xpath to this:
Code:
//*[((name()='h1' or name()='h3') and re:test(., '\d+', 'i')) or @class = 'chapter']
And now Calibre will create a TOC. If you're book uses <h4>, <h5> or something else, change the xpath appropriately.

If all the chapter tags in the book are h3 tags or similar, you could also click on the little magic wand icon next to the xpath, and just type 'h3' or it's equivalent into the first box - even simpler.


Poorly defined chapter headings:
Here's an example of a poorly defined chapter heading:
Code:
<p class="MsoNormal" align="center" style="mso-margin-top-alt:auto;mso-margin-bottom-alt: auto;text-align:center;line-height:normal"><span style="font-size:14.0pt; font-family:&quot;Times New Roman&quot;;mso-fareast-font-family:&quot;Times New Roman&quot;; color:black"></span></p>
<p class="MsoNormal" align="center" style="mso-margin-top-alt:auto;mso-margin-bottom-alt: auto;text-align:center;line-height:normal"><span style="font-size:14.0pt; font-family:&quot;Times New Roman&quot;;mso-fareast-font-family:&quot;Times New Roman&quot;; color:black">Chapter 2</span></p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto; line-height:normal"><span style="font-size:14.0pt;font-family:&quot;Times New Roman&quot;; mso-fareast-font-family:&quot;Times New Roman&quot;;color:black">The incredulous look must have been plain on my face. As she realized how her offer sounded, her
In this case, the chapter is just in a <p> tag, which is the same way plain text is treated in most ebooks. Getting Calibre to create a TOC with the same technique we used before won't work.

Generally the best solution for this type of chapter heading is to go into the Heuristic Processing panel of Calibre's conversion options and enable Heuristics. Heuristics will search for common types of chapter headings and wrap them with <h2> tags.

Now you can go into structure detection, click the magic wand next to the Chapter detection xpath, and just type 'h2' into the first box. Calibre should create a table of contents for this type of scenario.


Nothing worked, I'm getting Desperate
If neither of the above solutions for you is working, convert to epub and edit your book in Sigil. Using Sigil you can mark your Chapter headings manually (or possibly using Sigil's search and replace). Once you've finished, use Calibre to convert your new epub to your desired destination format - Calibre will preserve the TOC that was created by Sigil when it converts to the new format.

Last edited by ldolse; 02-19-2011 at 11:32 PM.
ldolse is offline   Reply With Quote