|
|
View Full Version : Chapter Detection Tutorial
ldolse 02-19-2011, 11:21 PM Since this seems to come up over and over again, I think it might be good to have a tutorial. Please provide feedback and once it looks good I'll sticky it.
Getting Calibre to appropriately detect Chapters and build a TOC requires some relatively simple examination of your book's html source code. This is required though it you want to be able to navigate Chapters on your Reader - e.g. the TOC viewer on epub readers, or the Kindle's 5-way controller.
Under the conversion options, go to Search and Replace. Click one of the magic wands on the right half of the screen. If you have multiple source formats Calibre will ask you to choose one - be sure to choose the correct one. Your book's html code will pop up in a new window.
Start scanning through the html code for your chapter headings. You can generally find one quite easily, but if you're having trouble try searching for the plain text that you see when viewing the chapter heading in a ebook reader/web browser.
There are two basic situations you'll run into at this point - the book has clearly defined chapter headings, or it doesn't. There are different ways of handling each case.
Well defined chapter headings:
A well defined chapter heading will typically have code that looks something like this:
<div class="chapter"></div><div><h3><a name="ch05" id="ch05">5</a> <br /><br/><br /></h3></div><p class="fl1">My nagging got the better o
In this case the chapter heading is just the number '5' In this example, all the book's headings are just numbers like this. When you look through the html code you can see these are wrapped with '<h3>' heading tags:<h3 class="calibre6"><a name="ch05" class="calibre9" id="ch05">5</a> <br class="calibre3"/><br class="calibre3"/><br class="calibre3"/></h3>
Other books could use <h1>, <h2>, <h4>, etc - this is why the source code needs to be examined - to figure out what's being used.
There is a box in the structure detection panel of conversion where you can configure an xpath to detect chapters, the default is this://*[((name()='h1' or name()='h2') and re:test(., 'chapter|book|section|part\s+', 'i')) or @class = 'chapter']
Note that only looks for h1 or h2 tags, but in our example we need h3 tags. It also has a regex that looks for the words chapter, book, section, or part, but we need numbers, which can be represented as '\d+'. If you're book's chapters just use varying words then you could use '.*'
So we can just change that xpath to this://*[((name()='h1' or name()='h3') and re:test(., '\d+', 'i')) or @class = 'chapter']
And now Calibre will create a TOC. If you're book uses <h4>, <h5> or something else, change the xpath appropriately.
If all the chapter tags in the book are h3 tags or similar, you could also click on the little magic wand icon next to the xpath, and just type 'h3' or it's equivalent into the first box - even simpler.
Poorly defined chapter headings:
Here's an example of a poorly defined chapter heading:
<p class="MsoNormal" align="center" style="mso-margin-top-alt:auto;mso-margin-bottom-alt: auto;text-align:center;line-height:normal"><span style="font-size:14.0pt; font-family:"Times New Roman";mso-fareast-font-family:"Times New Roman"; color:black"></span></p>
<p class="MsoNormal" align="center" style="mso-margin-top-alt:auto;mso-margin-bottom-alt: auto;text-align:center;line-height:normal"><span style="font-size:14.0pt; font-family:"Times New Roman";mso-fareast-font-family:"Times New Roman"; color:black">Chapter 2</span></p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto; line-height:normal"><span style="font-size:14.0pt;font-family:"Times New Roman"; mso-fareast-font-family:"Times New Roman";color:black">The incredulous look must have been plain on my face. As she realized how her offer sounded, her
In this case, the chapter is just in a <p> tag, which is the same way plain text is treated in most ebooks. Getting Calibre to create a TOC with the same technique we used before won't work.
Generally the best solution for this type of chapter heading is to go into the Heuristic Processing panel of Calibre's conversion options and enable Heuristics. Heuristics will search for common types of chapter headings and wrap them with <h2> tags.
Now you can go into structure detection, click the magic wand next to the Chapter detection xpath, and just type 'h2' into the first box. Calibre should create a table of contents for this type of scenario.
Nothing worked, I'm getting Desperate
If neither of the above solutions for you is working, convert to epub and edit your book in Sigil. Using Sigil you can mark your Chapter headings manually (or possibly using Sigil's search and replace). Once you've finished, use Calibre to convert your new epub to your desired destination format - Calibre will preserve the TOC that was created by Sigil when it converts to the new format.
theducks 02-20-2011, 01:06 PM The above examples at least had the key word "Chapter", Which I believe Calibre factory defaults will trap correctly :thumbsup:
Other common: chapter headings: Just digits or Roman Numerals, which are fairly easy to spot and trap with a REGEX expression.
If JUST the numbers are Spelled out (Seventeen), Or the Chapters just have words (A Golden Harvest") things get tricky. Hopefully you will find a unique pattern before and after, you trap on.
(I use trap, to mean a Search pattern match)
cybmole 02-21-2011, 02:59 AM the hardest case seems to be p tags around upper case ONE TWO THREE etc, progressing up to TWENTY-ONE ....
I worked out the letter set one time, which is a subset of the upper case alphabet, then used find & replace ( stepping through manually to be safe) the "can't be bothered to do that properly" set is [-E_Y]
ldolse 02-21-2011, 03:26 AM Maybe the tutorial didn't make handling that kind of situation clear, if your chapters use words describing numbers like your or the theducks example then you can use '.*' or '[A-Z' ]+' in the chapter detection xpath.
Alternatively, if all chapters use the exact same heading tag, and that heading isn't used elsewhere, you can just configure Calibre to build the TOC based 'only' on the heading tag itself, regardless of the contents of the tag.
I'll try to make those cases a bit more obvious.
If the chapter headings are all lower case numbers in <p> tags you're sort of out of luck, e.g.:
<p>Seventeen</p>
There isn't any good pattern there except for the fact that it's a short word/phrase without puncuation. When all else fails Heuristics will actually look for points like that and add page breaks, but it won't wrap them in <h2> tags because of a higher chance of false positives.
However if it's like this:
<h3>Seventeen</h3>
Then you can just use '.*' as you're regex, or just bypass regex altogether and just use '//h:h3' in the xpath box.
Piper_ 02-21-2011, 03:44 AM I was just about to crash when I saw this thread, ldolse. I'm too brain dead to think straight right now, but I look forward to it tomorrow, and :beer: to you for doing it!
cybmole 02-21-2011, 03:48 AM can you add the case of images being used as chapter headers to your excellent tutorial also please ( as per my current other thread on this )
ldolse 02-21-2011, 04:41 AM I'll add images, though I have a hunch that there isn't any really clean way to handle that case - you can build a TOC exactly as I described in the other thread, but the entries themselves in the TOC will be empty - I don't know if Calibre will look at 'title' or 'alt' tags if they exist, I sort of doubt it. This might be enough for a lot of Kindle users though, as the 5 way controller will still work. In Sigil it's possible with by using 'title' tags, but of course that requires hand editing.
qxlooper 02-25-2011, 04:20 PM Newbie here and have tried following this post. Using a PDF, I am trying to get a TOC.
I looked and I have <h2> </h2>, but there is a span in there also.
Any help would greatly be appreciated!
qxlooper 02-25-2011, 04:24 PM Here is I am looking at. Please help!
<p><font size="+1"><span class="bold">Chapter 1 Flight Crew Duties and Responsibilities Section 1 Normal Operations</span> Volume 2</font></p>
<h2 id="filepos167405"><span class="bold">Chapter 1 <a id="filepos167422"/>Flight Crew Duties and Responsibilities</span></h2>
qxlooper 02-25-2011, 04:39 PM Also this from Sigil:
<h2 class="calibre5" id="calibre_pb_15">Chapter 1A Station Information</h2>
theducks 02-25-2011, 04:54 PM Here is I am looking at. Please help!
<p><font size="+1"><span class="bold">Chapter 1 Flight Crew Duties and Responsibilities Section 1 Normal Operations</span> Volume 2</font></p>
<h2 id="filepos167405"><span class="bold">Chapter 1 <a id="filepos167422"/>Flight Crew Duties and Responsibilities</span></h2>
I am not sure what you want.
If you edit in Sigil:
The first is just a paragraph
If you want to make it a Level 1, replace:
<p><font size="+1"><span class="bold">
with
<h1>
and change the trailing </p> to </h1>
the second is ALREADY a valid second level (the level is 2 if after a H1 somewhere, otherwise it is top level)
Sigil will build your TOC from the H1, h2,h3 tags
theducks 02-25-2011, 04:55 PM Also this from Sigil:
<h2 class="calibre5" id="calibre_pb_15">Chapter 1A Station Information</h2>
Valid for a level 2 TOC
qxlooper 02-25-2011, 06:22 PM I guess what I am asking is how do I do this? I have no clue in either of the programs to make it happen. In Calibre, I do the test function and it finds every h in the document. What, and where do I put where?
The things I posted are from the file. Don't know how to capture the chapter parts. I posted what I was seeing, not knowing how to capture it to make a TOC.
Thanks for helping the newbie!
Keith
I want to just capture the chapters. There are subsections, but just want the chapters to start, maybe the sub sections later. But for now just the chapters.
ldolse 02-25-2011, 09:08 PM Your other posts are a bit confusing as you're talking about using Sigil and Calibre, but you're not saying where any of the examples you're pasting in came from. - i.e. 'exactly' how you got that text and pasted it into these forums.
If this line:
<h2 id="filepos167405"><span class="bold">Chapter 1 <a id="filepos167422"/>Flight Crew Duties and Responsibilities</span></h2>
Was created by/seen using Calibre, then the default chapter detection xpath will work - you shouldn't have to make any changes from defaults. However if you did that using Sigil then you need to finish what you started and do that for all the chapters.
You're not going to be able to easily avoid the sub-chapters - are you saying that you're successfully getting a TOC from Calibre but it's the sub-chapters you don't want?
qxlooper 02-25-2011, 11:54 PM I am getting nothing! I get a TOC with a start page. Nothing else. that is why I am lost as to what I should do. The first is from calibre, the second from sigil. I tried the default in calibre and it bookmarked everything except what I wanted it to and created over 2000+ bookmarks because anything that had an h in it was book marked.
I guess I just need an idiots guide to making a toc from a pdf/epub. If anyone wants the challenge, I would be willing to pay to have it done!
Keith
theducks 02-26-2011, 12:13 AM I am getting nothing! I get a TOC with a start page. Nothing else. that is why I am lost as to what I should do. The first is from calibre, the second from sigil. I tried the default in calibre and it bookmarked everything except what I wanted it to and created over 2000+ bookmarks because anything that had an h in it was book marked.
I guess I just need an idiots guide to making a toc from a pdf/epub. If anyone wants the challenge, I would be willing to pay to have it done!
Keith
Quit bouncing back and forth :blink:, you probably are undoing any Sigil work, if you convert Again in Calibre.
:offtopic: (because this belongs in Sigil)
Sigil makesEPUB TOC easy. Place the cursor on the line (in book view) you want to appear in the TOC. Set the Heading Level using the pull-down box provided.
Repeat on the next location.
Press F7 (the TOC editor) to see what you have. remove the Tics' from the lines you DO NOT want in the TOC. OK
Finish up. Save
qxlooper 02-26-2011, 05:01 PM Thanks for the help. Got the TOC in sigil that i wanted. Took out all that I didn't. Saved it, opened in calibre, it was there, converted to mobi, GONE. :angry: How can I keep it in the conversion to mobi?
Thanks for all the help!
Keith
DoctorOhh 02-26-2011, 09:09 PM Got the TOC in sigil that i wanted. Took out all that I didn't. Saved it, opened in calibre, it was there, converted to mobi, GONE. :angry: How can I keep it in the conversion to mobi?
A better question is what are you setting that caused it disappear?
I just tried a ePub to Mobi conversion. I ensured everything in the TOC area of conversion was blank or unchecked and left the Number of links at 50 and chapter threashold at 6. I did not check anything in the mobi output. I kept hueristic processing off. The TOC ended up fine in the Mobi.
If your ePub has a properly formatted TOC then when converting to Mobi it should have a good TOC, unless you attempted to adjust things under the TOC or Mobi settings area during conversion.
I don't use Mobi, maybe others can jump in with a plausible reason for what you're experiencing.
theducks 02-26-2011, 09:41 PM (NB I don't use Mobi)
:chinscratch: It almost sounds like qxlooper is not converting the Sigil edited version of the EPUB (or is converting a different format to Mobi. Calibre remembers the LAST convert IN and OUT settings ) :smack: Be sure to drop the 'fixed' EPUB into the meta-data edit window, then pay close attention to the Source format setting, before clicking 'Convert'
DoctorOhh 02-26-2011, 09:47 PM :chinscratch: It almost sounds like qxlooper is not converting the Sigil edited version of the EPUB (or is converting a different format to Mobi. Calibre remembers the LAST convert IN and OUT settings ) :smack: Be sure to drop the 'fixed' EPUB into the meta-data edit window, then pay close attention to the Source format setting, before clicking 'Convert'
You may be right. I always use the Open With GUI plugin (http://www.mobileread.com/forums/showthread.php?t=118761) and right click to open the epub in Sigil. Doing this ensures I always save to my existing ePub in the calibre folder and never have to worry about a separate version of the ePub.
cybmole 02-27-2011, 04:57 AM it can be very frustrating when the TOC fails to show in mobi but it's usually ( in my case anyway) a user error.
my standard sequence is fix header tags with calibre heuristics, open + save in sigil to force a new , header based, toc creation ( and I delete any hardcoded one at the same time), then epub to mobi on default settings. check with calibre viewer at each stage.
qxlooper 02-28-2011, 02:55 PM I will try again!
Thanks for the replies!
qxlooper 03-01-2011, 01:56 PM Still no luck! I get all the chapters in the toc, save as a new epub(sigil), open in calibre, see that it is there, then without changing any settings click convert. It does it thing, reopen the mobi, toc, gone. Guess I am doing something wrong. Is doing it in word easier?
Keith
cybmole 03-02-2011, 02:00 AM do it again - this type click reset defaults on convert screen. check / post all of your epub to mobi conversion settings if it still does not work
ldolse 03-02-2011, 03:37 AM Reset defaults, and also import the fixed epub from Sigil to a new book record just to make sure none of your customized conversion options for that book aren't getting in the way.
cybmole 03-02-2011, 10:45 AM any suggestions on handling/tweaking an img alt tag please e.g.
<h2 class="calibre1 sgc-2" id="heading_id_2"><img alt="[Chapter 1]" class="calibre9" src="../Images/001.jpg" /></h2>
i see nothing in TOC, in viewer
ldolse 03-02-2011, 11:09 AM I suspect that there is a way to do it with XPATH, but you could just use search and replace:
Search:(?P<hopen><h2[^>]*)><img alt="(?P<title>.*?)"
Replace:\g<hopen> title="\g<title>"><img
After that just open it in Sigil and it will create a TOC, save and your done.
cybmole 03-02-2011, 11:30 AM i'm struggling to follow that code. as both hopen & \g are new to me.
but you are replacing the img alt tag with a title tag, I think ?
i picked out the chapter number with a regex and inserted it as text, in a small font, before the image. toc was OK then. not ideal but I can live with it.
so my code became
<h2 class="calibre10" id="heading_id_2">1 <img alt="[Chapter 1]" class="calibre11" src="../Images/001.jpg" /></h2>
and I set font size in calibre10 style to 0.5em
is img alt not supported in epub ?
theducks 03-02-2011, 11:55 AM i'm struggling to follow that code. as both hopen & \g are new to me.
but you are replacing the img alt tag with a title tag, I think ?
i picked out the chapter number with a regex and inserted it as text, in a small font, before the image. toc was OK then. not ideal but I can live with it.
so my code became
<h2 class="calibre10" id="heading_id_2">1 <img alt="[Chapter 1]" class="calibre11" src="../Images/001.jpg" /></h2>
and I set font size in calibre10 style to 0.5em
is img alt not supported in epub ?
Image Alt is for non-visual displays (TTS)
<H2 title="[Chapter 1]" ...
gets that label into the TOC that uses headings (Sigil ;) )
ldolse 03-02-2011, 11:56 AM i'm struggling to follow that code. as both hopen & \g are new to me.
but you are replacing the img alt tag with a title tag, I think ?
i picked out the chapter number with a regex and inserted it as text, in a small font, before the image. toc was OK then. not ideal but I can live with it.
so my code became
<h2 class="calibre10" id="heading_id_2">1 <img alt="[Chapter 1]" class="calibre11" src="../Images/001.jpg" /></h2>
and I set font size in calibre10 style to 0.5em
is img alt not supported in epub ?
?P<stuff> at just after an open parentheses lets you use names instead of \1 \2 or whatever syntax numeric backreferences use - I don't use them much as names are more readable and don't require counting parentheses.
<hopen> was a name I just made up, as was <title>. \g<groupname> is how you refer to the variables in the replacement expression.
As alt tags have no purpose on an ebook the search and replace just deleted the alt tag and moved it's contents into the title tag in the header. Sigil reads title tags when it creates a TOC, they have priority over any text that may be inside the <h2> tags (which is none in this case).
ldolse 03-02-2011, 11:58 AM Image Alt is for non-visual displays (TTS)
<H2 title="[Chapter 1]" ...
gets that label into the TOC that uses headings (Sigil ;) )
Didn't think about TTS, but I imagine they might read title tags as well? Anyway it's easy enough to modify the regex/replacement to keep the alt tag if desired.
theducks 03-02-2011, 12:03 PM Here is a example <h3 id="heading_id_2" title="Chapter 1"> <img alt="Chapter 1" class="ch" src="../Images/the%20alton%20gift-2.jpg" /></h3>
that passes 'Flightcrew'
cybmole 03-02-2011, 01:14 PM go it - thanks - you told me once before about using title tags to work around images as chapter headers I think, so if I find another example I'll try it your way.
maybe img alt works in some other ebook formats - otherwise I cannot think why anyone would include it in the 1st place ?
ah - google says tts is text to speech so maybe kindle could "read" out loud chapter 1, from that alt tag, as it has a built-in computerised-voice reader that I never use. so does the MS reader for Lit format.
I have a lit version & so try it in MSreader on my pc - , the "voice" says "graphic chapter 1" as it encounters the original code....
theducks 03-02-2011, 03:45 PM go it - thanks - you told me once before about using title tags to work around images as chapter headers I think, so if I find another example I'll try it your way.
maybe img alt works in some other ebook formats - otherwise I cannot think why anyone would include it in the 1st place ?
ah - google says tts is text to speech so maybe kindle could "read" out loud chapter 1, from that alt tag, as it has a built-in computerised-voice reader that I never use. so does the MS reader for Lit format.
I have a lit version & so try it in MSreader on my pc - , the "voice" says "graphic chapter 1" as it encounters the original code....
Using the ALT tag is good practice.
Having a meaningful phrase there is better practice. :D
I use a simple , but terse description "Map", "scene breaker", "Publisher Logo"
Chrysanthemum 01-11-2012, 06:32 PM Feeling like a complete idiot here, but I cannot understand any of this. I create my ePubs using OpenOffice because I can use the Navigator to drag and drop the chapters to create clickable chapter links in the Table of Contents that works perfectly with my Sony eReaders. But when I added such an ePub to iBooks on my iPad, the chapters were not there. The 'Contents' on the book's iPad version strangely had the chapters all there by page number but none of the chapter titles. Enter Calibre. I tried converting my OpenOffice ePub to a Calibre ePub hoping the chapter auto detection would work and solve my problem. But instead, the Calibre ePub had no chapters at all, resulting in less functionality than I started with.
I have no idea how to discover what the "code" of my text is. Or how to figure out why Calibre does not recognize the chapters I painstakingly created. I have tried to understand this many times before. I have tried using Sigil, and it's even less understandable than Calibre. I really need help here.
Thanks
Below is the ePub created by Calibre, and below that is the original epub I created with Open Office.
|