generate TOC duplicates puzzle

cybmole · 01-03-2011, 03:30 AM

I am converting from epub to mobi and trying to generate a TOC.
the epub is split by chapter and the text CHAPTER occurs only as a chapter heading. i.e. the epub html file start CHAPTER 1 CHAPTER 2 etc.

so I set a chapter detect regex as CHAPTER \d and force TOC generate.

in the MOBI i do then get TOC but each chapter entry is there 4 times. each of the 4 link to the chapter start correctly, but I cant figure what is creating the duplicates
i get e.g. 2 entries saying CHAPTER 1 ( folloed by 1st line of text)
plus 2 entries saying only CHAPTER 1.

please see uploaded screen grab of TOC in calibre viewer

Manichean · 01-03-2011, 04:53 AM

As usual, use the HTML interstage to create the chapter descriptions. It may just be that in the source, the word "chapter" occurs four times at the beginning of each chapter. Also, I believe the chapter detection is done by XPath instead of regexes, so if you do use regexes, results may vary.

Edit: Oh, and I assume you do know that

Code:

CHAPTER \d

only works for single-digit chapter headings? Use

Code:

CHAPTER \d+

if you have more than 9 chapters.

cybmole · 01-03-2011, 05:07 AM

Quote:

Originally Posted by Manichean

As usual, use the HTML interstage to create the chapter descriptions.

...don't understand that, sorry.

It may just be that in the source, the word "chapter" occurs four times at the beginning of each chapter.

...definite not - used sigil FIND option to confirm that the word is only at start of chapter. e.g.

Code:

</head>

<body class="calibre" style="">
  <div class="Section">
    <p class="MsoPlainText"></p>

    <p class="MsoPlainText"><span>CHAPTER 1</span></p>

    <p class="MsoPlainText"><span>"I say we should stake him to an anthill and throw little pickles at him."</span></p>

Also, I believe the chapter detection is done by XPath instead of regexes, so if you do use regexes, results may vary.

...correct. i used the xpath wizards - i tred both using the look & feel screen option and the structure detection screen options, one at a time..

Edit: Oh, and I assume you do know that

Code:

CHAPTER \d

only works for single-digit chapter headings? Use

Code:

CHAPTER \d+

if you have more than 9 chapters.

...i figured that [chapter n] was a subset of [chapter nn] so would be detected ....

.. and .it seemed to work - the exact wizard generated expression is

Code:

 //*[re:test(., "CHAPTER \d", "i")]

and that is finding all 18 chapters but four times each!

could it be there's an additional epub file that is being searched ? the rpub ( in calibre viewer) has nothing except START in its TOC.

cybmole · 01-03-2011, 05:19 AM

update:

I tried again using simple CHAPTER, not CHAPTER \d

similar results. TOC also contains a per chapter entry for NIGHT, which seems to match this piece of code ?

Code:

  <title>NIGHT</title>
  <link href="../Styles/stylesheet.css" rel="stylesheet" type="text/css" />
  <style type="text/css">
/*<![CDATA[*/
                @page { margin-bottom: 5.000000pt; margin-top: 5.000000pt; }
  /*]]>*/

that stuff about margin bottom also appears in the toc ???

PS is there a tutorial on xpath that I should study, rather than just wing it.

PPS I know I can fix this manually by editing all CHAPTER occurence tags in sigil - from normal - to H1 or H2 - but that is tedious, & devising a regex for that is a bit beyond me. manually is open book in sigil- find CHAPTER, use drop down menu to change normal to H2, rinse & repeat... calibre will create a TOC once H2 tags are in place.

So at present I'm trying to find an alternative way to derive a TOC, based on recognition of books' chapter labelling, which is usually CHAPTER ONE, CHAPTER TWO,etc or CHAPTER 1, CHAPTER 2 etc

cybmole · 01-03-2011, 05:37 AM

[QUOTE=Manichean; Also, I believe the chapter detection is done by XPath instead of regexes, so if you do use regexes, results may vary.

.[/QUOTE]

in the calibre path wizard screen there are 3 boxes to complete, it tells you to use a regex in the value box.
the line of code I want to detect is eg

Code:

<p class="MsoPlainText"><span>CHAPTER 2</span></p>

that class is used throughout the book, so focussing on the word CHAPTER seems to be the way to go
so I used tag=*(default), attribute BLANK (default), value CHAPTER \D

trying again with tag = span...... no difference...

PS the "night" entry in toc reads

Code:

night /*/@page{margin-bottom  etc

now NIGHT is in the title html tags as per extract posted earlier, but how is that also getting into the TOC , once per chapter ? I can supress it by putting the word NIGHT into TOC FILTER box, but don't understand the need for that.

Manichean · 01-03-2011, 06:53 AM

There's a XPath tutorial in the usual place.

Did you try using \d+ instead of \d? Looking at what you copied as a chapter heading

Code:

<p class="MsoPlainText"><span>CHAPTER 2</span></p>

I suspect that something like

Code:

//h:p/h:span[re:test(.,'CHAPTER \d+','')]

ought to work. If that still generates four entries per chapter, I'd have a look at the source code if I were you. In that case, I'd suspect there really are four chapter headings hidden somewhere (think inline TOC if there is one...).

cybmole · 01-03-2011, 08:07 AM

well your code worked fine. I put in in structure detection.

now I just need to understand how you did it !

testing your code without the plus sign now.... that also worked ( so my "subset of "argument is valid )

both times I get exactly 18 TOC entries corresponding to 18 chapters.

so it's some other difference between your code & my 1st attempt that is taking care of the duplicates ?

Manichean · 01-03-2011, 08:18 AM

You used

Code:

//*[re:test(., "CHAPTER \d", "i")]

which tests for any occurence of the word "chapter" in upper- or lowercase followed by a single number and inside any tag. My XPath expression tests for an uppercase "CHAPTER" followed by one or more numbers inside a span tag, which itself is inside a p tag. Your expression, for whatever reasons, fits four incidences per chapter. Look at the source and figure it out. And look at the XPath tutorial I linked to earlier.

ldolse · 01-03-2011, 08:23 AM

You could also try enabling preprocessing under structure detection for this sort of scenario. Aside from fixing hard line breaks it also looks for common chapter headings which aren't properly marked up and wraps them in <h2> tags. The scenario you're describing here is covered. In this particular case the default chapter detection xpath would work when combined with the preprocess option.

Edit - just saw this was epub source - in that case it's more complicated as epub is assumed to already be 'good'. You'd need to rename the file from .epub to .zip and add a new format as zipped html to be able to use the preprocess option. In most cases where this occurs it would be worth the effort though, since the file was more than likely originally converted from Lit to epub by someone else using an older version of Calibre, and since they didn't mark it up correctly it means that there are probably also page breaks scattered randomly throughout the book. I'd convert from zip back to epub using Calibre, then edit in Sigil and re-join the broken chapters together before trying to convert to mobi.

cybmole · 01-03-2011, 08:31 AM

Quote:

Originally Posted by Manichean

You used

Code:

//*[re:test(., "CHAPTER \d", "i")]

which tests for any occurence of the word "chapter" in upper- or lowercase followed by a single number and inside any tag. My XPath expression tests for an uppercase "CHAPTER" followed by one or more numbers inside a span tag, which itself is inside a p tag. Your expression, for whatever reasons, fits four incidences per chapter. Look at the source and figure it out. And look at the XPath tutorial I linked to earlier.

thanks for the extra explanation, but I still can't figure how i get exactly 4 in every chapter. FWIW all 4 toc entries jump to same chapter start.
let me post one sample chapter as zipped xhtml source & tell me what I'm missing here please. there are no strange files in the EPUB when I explode it, so my code is somehow deriving 4 instances from this source file

ldolse · 01-03-2011, 08:33 AM

Quote:

Originally Posted by cybmole

thanks for the extra explanation, but I still can't figure how i get exactly 4 in every chapter. FWIW all 4 toc entries jump to same chapter start.
let me post one sample chapter as xhtml sourc tell me what I'm missing here please

Your regex for whatever reason is matching four times. Each time it matches it generates a new TOC entry (to the same place). This is standard behavior for multiple xpath matches. Why it's matching multiple times is not clear, but since you've got a working solution I'd just stay away from what didn't work.

cybmole · 01-03-2011, 08:42 AM

Quote:

Originally Posted by ldolse

Edit - just saw this was epub source -

it began as a lit source, but I had all the structure detection / preprocess options OFF as default. ( thinking they were only needed for pdf, and that I could do any needed work with sigil)

I now try a similar book, going lit to mobi, this time with structure detection ON and force TOC on, and yes a TOC is created- ditto for lit to epub.

so the moral here is that I should go with the default preprocessing when working with .LIT sources I guess.

Lesson learnt - don't change defaults without understanding the consequences.

the 4 x duplicates remains a ( less important) puzzle though.

PS I did not "get" the value of TOC in novels, until I found that Kindle will jump to next / prev chapter via single click of its right or left arrow buttons, if there's a chapter based TOC in the book.
That makes a good TOC worth having.

cybmole · 01-09-2011, 07:54 AM

i have another challenging book - there are chapters, within PARTS & the defaults only added PARTS to TOC.

I tried multiple hacks & eventually got chapter numbers without parts by telling xpath to look only for \d+ values, but lost / reformatted the PARTS code somewhere along the way.

when first converted (from a LIT source ) both the parts & the chapter numbers had h2 tags but detecting all h2 stubbornly refused to work.
but I cannot get both parts & chapters whatever I try ???

relevent code lines in my epub - after much to & fro between epub & mobi, now looks like
:

Code:

 <p class="calibre2"><span class="calibre4"><span class="bold">Part I</span></span></p><a class="calibre3"></a>
<p class="calibre2"><span class="calibre4"><span class="bold">9</span></span></p>

that is for part 1 and for chapter 9 as examples. what do I need in the chapter detection xpath to get both ?

I have tried a regex of \d*|Part I in the xpath wizard ,with class bold, but no joy. it finds all the chapter numbers but not the Part I, Part II etc.

If I go all the way back to lit, what can I add to default chapter detection so that it looks for stand-alone numbers as well as for part, section, chapter etc

cybmole · 01-09-2011, 08:21 AM

update - it's OK - I got it - i went all the way back to original lit source and aadded \d*| to the default detection formula - I how have a good TOC with both parts and chapters within parts.

I do also find that lit to epub adds a lot of extra white space between paragraphs but I can fix that.

ldolse · 01-09-2011, 07:01 PM

Quote:

Originally Posted by cybmole

I do also find that lit to epub adds a lot of extra white space between paragraphs but I can fix that.

A lot of lit files have large spaces between paragraphs encoded in them - use the 'remove paragraph spacing' option under look and feel to try and remove it.

01-03-2011, 03:30 AM	#1
cybmole Wizard Posts: 3,720 Karma: 1759970 Join Date: Sep 2010 Device: none	generate TOC duplicates puzzle I am converting from epub to mobi and trying to generate a TOC. the epub is split by chapter and the text CHAPTER occurs only as a chapter heading. i.e. the epub html file start CHAPTER 1 CHAPTER 2 etc. so I set a chapter detect regex as CHAPTER \d and force TOC generate. in the MOBI i do then get TOC but each chapter entry is there 4 times. each of the 4 link to the chapter start correctly, but I cant figure what is creating the duplicates i get e.g. 2 entries saying CHAPTER 1 ( folloed by 1st line of text) plus 2 entries saying only CHAPTER 1. please see uploaded screen grab of TOC in calibre viewer Attached Thumbnails Last edited by cybmole; 01-03-2011 at 03:59 AM.

01-03-2011, 04:53 AM	#2
Manichean Wizard Posts: 3,130 Karma: 91256 Join Date: Feb 2008 Location: Germany Device: Cybook Gen3	As usual, use the HTML interstage to create the chapter descriptions. It may just be that in the source, the word "chapter" occurs four times at the beginning of each chapter. Also, I believe the chapter detection is done by XPath instead of regexes, so if you do use regexes, results may vary. Edit: Oh, and I assume you do know that Code: CHAPTER \d only works for single-digit chapter headings? Use Code: CHAPTER \d+ if you have more than 9 chapters.

01-03-2011, 05:19 AM	#4
cybmole Wizard Posts: 3,720 Karma: 1759970 Join Date: Sep 2010 Device: none	update: I tried again using simple CHAPTER, not CHAPTER \d similar results. TOC also contains a per chapter entry for NIGHT, which seems to match this piece of code ? Code: <title>NIGHT</title> <link href="../Styles/stylesheet.css" rel="stylesheet" type="text/css" /> <style type="text/css"> /<![CDATA[/ @page { margin-bottom: 5.000000pt; margin-top: 5.000000pt; } /]]>/ that stuff about margin bottom also appears in the toc ??? PS is there a tutorial on xpath that I should study, rather than just wing it. PPS I know I can fix this manually by editing all CHAPTER occurence tags in sigil - from normal - to H1 or H2 - but that is tedious, & devising a regex for that is a bit beyond me. manually is open book in sigil- find CHAPTER, use drop down menu to change normal to H2, rinse & repeat... calibre will create a TOC once H2 tags are in place. So at present I'm trying to find an alternative way to derive a TOC, based on recognition of books' chapter labelling, which is usually CHAPTER ONE, CHAPTER TWO,etc or CHAPTER 1, CHAPTER 2 etc Last edited by cybmole; 01-03-2011 at 05:21 AM.

01-03-2011, 05:37 AM	#5
cybmole Wizard Posts: 3,720 Karma: 1759970 Join Date: Sep 2010 Device: none	[QUOTE=Manichean; Also, I believe the chapter detection is done by XPath instead of regexes, so if you do use regexes, results may vary. .[/QUOTE] in the calibre path wizard screen there are 3 boxes to complete, it tells you to use a regex in the value box. the line of code I want to detect is eg Code: <p class="MsoPlainText"><span>CHAPTER 2</span></p> that class is used throughout the book, so focussing on the word CHAPTER seems to be the way to go so I used tag=(default), attribute BLANK (default), value CHAPTER \D trying again with tag = span...... no difference... PS the "night" entry in toc reads Code: night //@page{margin-bottom etc now NIGHT is in the title html tags as per extract posted earlier, but how is that also getting into the TOC , once per chapter ? I can supress it by putting the word NIGHT into TOC FILTER box, but don't understand the need for that. Last edited by cybmole; 01-03-2011 at 05:47 AM.

01-03-2011, 06:53 AM	#6
Manichean Wizard Posts: 3,130 Karma: 91256 Join Date: Feb 2008 Location: Germany Device: Cybook Gen3	There's a XPath tutorial in the usual place. Did you try using \d+ instead of \d? Looking at what you copied as a chapter heading Code: <p class="MsoPlainText"><span>CHAPTER 2</span></p> I suspect that something like Code: //h:p/h:span[re:test(.,'CHAPTER \d+','')] ought to work. If that still generates four entries per chapter, I'd have a look at the source code if I were you. In that case, I'd suspect there really are four chapter headings hidden somewhere (think inline TOC if there is one...).

01-03-2011, 08:07 AM	#7
cybmole Wizard Posts: 3,720 Karma: 1759970 Join Date: Sep 2010 Device: none	well your code worked fine. I put in in structure detection. now I just need to understand how you did it ! testing your code without the plus sign now.... that also worked ( so my "subset of "argument is valid ) both times I get exactly 18 TOC entries corresponding to 18 chapters. so it's some other difference between your code & my 1st attempt that is taking care of the duplicates ?

01-03-2011, 08:18 AM	#8
Manichean Wizard Posts: 3,130 Karma: 91256 Join Date: Feb 2008 Location: Germany Device: Cybook Gen3	You used Code: //*[re:test(., "CHAPTER \d", "i")] which tests for any occurence of the word "chapter" in upper- or lowercase followed by a single number and inside any tag. My XPath expression tests for an uppercase "CHAPTER" followed by one or more numbers inside a span tag, which itself is inside a p tag. Your expression, for whatever reasons, fits four incidences per chapter. Look at the source and figure it out. And look at the XPath tutorial I linked to earlier.

01-03-2011, 08:23 AM	#9
ldolse Wizard Posts: 1,337 Karma: 123455 Join Date: Apr 2009 Location: Malaysia Device: PRS-650, iPhone	You could also try enabling preprocessing under structure detection for this sort of scenario. Aside from fixing hard line breaks it also looks for common chapter headings which aren't properly marked up and wraps them in <h2> tags. The scenario you're describing here is covered. In this particular case the default chapter detection xpath would work when combined with the preprocess option. Edit - just saw this was epub source - in that case it's more complicated as epub is assumed to already be 'good'. You'd need to rename the file from .epub to .zip and add a new format as zipped html to be able to use the preprocess option. In most cases where this occurs it would be worth the effort though, since the file was more than likely originally converted from Lit to epub by someone else using an older version of Calibre, and since they didn't mark it up correctly it means that there are probably also page breaks scattered randomly throughout the book. I'd convert from zip back to epub using Calibre, then edit in Sigil and re-join the broken chapters together before trying to convert to mobi. Last edited by ldolse; 01-03-2011 at 08:30 AM.

01-09-2011, 07:54 AM	#13
cybmole Wizard Posts: 3,720 Karma: 1759970 Join Date: Sep 2010 Device: none	i have another challenging book - there are chapters, within PARTS & the defaults only added PARTS to TOC. I tried multiple hacks & eventually got chapter numbers without parts by telling xpath to look only for \d+ values, but lost / reformatted the PARTS code somewhere along the way. when first converted (from a LIT source ) both the parts & the chapter numbers had h2 tags but detecting all h2 stubbornly refused to work. but I cannot get both parts & chapters whatever I try ??? relevent code lines in my epub - after much to & fro between epub & mobi, now looks like : Code: <p class="calibre2"><span class="calibre4"><span class="bold">Part I</span></span></p><a class="calibre3"></a> <p class="calibre2"><span class="calibre4"><span class="bold">9</span></span></p> that is for part 1 and for chapter 9 as examples. what do I need in the chapter detection xpath to get both ? I have tried a regex of \d\|Part I in the xpath wizard ,with class bold, but no joy. it finds all the chapter numbers but not the Part I, Part II etc. If I go all the way back to lit, what can I add to default chapter detection so that it looks for stand-alone numbers as well as for part, section, chapter etc Last edited by cybmole; 01-09-2011 at 08:16 AM.*

01-09-2011, 08:21 AM	#14
cybmole Wizard Posts: 3,720 Karma: 1759970 Join Date: Sep 2010 Device: none	update - it's OK - I got it - i went all the way back to original lit source and aadded \d*\| to the default detection formula - I how have a good TOC with both parts and chapters within parts. I do also find that lit to epub adds a lot of extra white space between paragraphs but I can fix that.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
can't generate a toc from an html file	p3aul	Calibre	13	08-27-2010 05:44 AM
How not to auto-generate TOC in Calibre -setting?	Jundle	Calibre	0	05-05-2010 02:53 AM
Duplicates	pauldadams	Calibre	17	05-04-2010 11:57 PM
Duplicates...	jaxx6166	Sony Reader	5	07-09-2009 09:13 PM
Making a TOC for LRFs? Issues with Calibre + LRF TOC editor not working	Magitek	LRF	0	05-06-2009 01:25 PM

Advert

Advert