Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre

Notices

Reply
 
Thread Tools Search this Thread
Old 01-03-2011, 03:30 AM   #1
cybmole
Wizard
cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.
 
Posts: 3,720
Karma: 1759970
Join Date: Sep 2010
Device: none
generate TOC duplicates puzzle

I am converting from epub to mobi and trying to generate a TOC.
the epub is split by chapter and the text CHAPTER occurs only as a chapter heading. i.e. the epub html file start CHAPTER 1 CHAPTER 2 etc.

so I set a chapter detect regex as CHAPTER \d and force TOC generate.

in the MOBI i do then get TOC but each chapter entry is there 4 times. each of the 4 link to the chapter start correctly, but I cant figure what is creating the duplicates
i get e.g. 2 entries saying CHAPTER 1 ( folloed by 1st line of text)
plus 2 entries saying only CHAPTER 1.

please see uploaded screen grab of TOC in calibre viewer
Attached Thumbnails
Click image for larger version

Name:	New Picture.jpg
Views:	255
Size:	38.2 KB
ID:	64012  

Last edited by cybmole; 01-03-2011 at 03:59 AM.
cybmole is offline   Reply With Quote
Old 01-03-2011, 04:53 AM   #2
Manichean
Wizard
Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.
 
Manichean's Avatar
 
Posts: 3,130
Karma: 91256
Join Date: Feb 2008
Location: Germany
Device: Cybook Gen3
As usual, use the HTML interstage to create the chapter descriptions. It may just be that in the source, the word "chapter" occurs four times at the beginning of each chapter. Also, I believe the chapter detection is done by XPath instead of regexes, so if you do use regexes, results may vary.

Edit: Oh, and I assume you do know that
Code:
CHAPTER \d
only works for single-digit chapter headings? Use
Code:
CHAPTER \d+
if you have more than 9 chapters.
Manichean is offline   Reply With Quote
Advert
Old 01-03-2011, 05:07 AM   #3
cybmole
Wizard
cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.
 
Posts: 3,720
Karma: 1759970
Join Date: Sep 2010
Device: none
Quote:
Originally Posted by Manichean View Post
As usual, use the HTML interstage to create the chapter descriptions.

...don't understand that, sorry.

It may just be that in the source, the word "chapter" occurs four times at the beginning of each chapter.


...definite not - used sigil FIND option to confirm that the word is only at start of chapter. e.g.
Code:
</head>

<body class="calibre" style="">
  <div class="Section">
    <p class="MsoPlainText"></p>

    <p class="MsoPlainText"><span>CHAPTER 1</span></p>

    <p class="MsoPlainText"><span>"I say we should stake him to an anthill and throw little pickles at him."</span></p>
Also, I believe the chapter detection is done by XPath instead of regexes, so if you do use regexes, results may vary.

...correct. i used the xpath wizards - i tred both using the look & feel screen option and the structure detection screen options, one at a time..


Edit: Oh, and I assume you do know that
Code:
CHAPTER \d
only works for single-digit chapter headings? Use
Code:
CHAPTER \d+
if you have more than 9 chapters.
...i figured that [chapter n] was a subset of [chapter nn] so would be detected ....

.. and .it seemed to work - the exact wizard generated expression is
Code:
 //*[re:test(., "CHAPTER \d", "i")]
and that is finding all 18 chapters but four times each!

could it be there's an additional epub file that is being searched ? the rpub ( in calibre viewer) has nothing except START in its TOC.

Last edited by cybmole; 01-03-2011 at 05:10 AM.
cybmole is offline   Reply With Quote
Old 01-03-2011, 05:19 AM   #4
cybmole
Wizard
cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.
 
Posts: 3,720
Karma: 1759970
Join Date: Sep 2010
Device: none
update:

I tried again using simple CHAPTER, not CHAPTER \d

similar results. TOC also contains a per chapter entry for NIGHT, which seems to match this piece of code ?
Code:
  <title>NIGHT</title>
  <link href="../Styles/stylesheet.css" rel="stylesheet" type="text/css" />
  <style type="text/css">
/*<![CDATA[*/
                @page { margin-bottom: 5.000000pt; margin-top: 5.000000pt; }
  /*]]>*/
that stuff about margin bottom also appears in the toc ???

PS is there a tutorial on xpath that I should study, rather than just wing it.

PPS I know I can fix this manually by editing all CHAPTER occurence tags in sigil - from normal - to H1 or H2 - but that is tedious, & devising a regex for that is a bit beyond me. manually is open book in sigil- find CHAPTER, use drop down menu to change normal to H2, rinse & repeat... calibre will create a TOC once H2 tags are in place.

So at present I'm trying to find an alternative way to derive a TOC, based on recognition of books' chapter labelling, which is usually CHAPTER ONE, CHAPTER TWO,etc or CHAPTER 1, CHAPTER 2 etc

Last edited by cybmole; 01-03-2011 at 05:21 AM.
cybmole is offline   Reply With Quote
Old 01-03-2011, 05:37 AM   #5
cybmole
Wizard
cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.
 
Posts: 3,720
Karma: 1759970
Join Date: Sep 2010
Device: none
[QUOTE=Manichean; Also, I believe the chapter detection is done by XPath instead of regexes, so if you do use regexes, results may vary.

.[/QUOTE]

in the calibre path wizard screen there are 3 boxes to complete, it tells you to use a regex in the value box.
the line of code I want to detect is eg
Code:
<p class="MsoPlainText"><span>CHAPTER 2</span></p>
that class is used throughout the book, so focussing on the word CHAPTER seems to be the way to go
so I used tag=*(default), attribute BLANK (default), value CHAPTER \D

trying again with tag = span...... no difference...

PS the "night" entry in toc reads
Code:
night /*/@page{margin-bottom  etc
now NIGHT is in the title html tags as per extract posted earlier, but how is that also getting into the TOC , once per chapter ? I can supress it by putting the word NIGHT into TOC FILTER box, but don't understand the need for that.

Last edited by cybmole; 01-03-2011 at 05:47 AM.
cybmole is offline   Reply With Quote
Advert
Old 01-03-2011, 06:53 AM   #6
Manichean
Wizard
Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.
 
Manichean's Avatar
 
Posts: 3,130
Karma: 91256
Join Date: Feb 2008
Location: Germany
Device: Cybook Gen3
There's a XPath tutorial in the usual place.

Did you try using \d+ instead of \d? Looking at what you copied as a chapter heading
Code:
<p class="MsoPlainText"><span>CHAPTER 2</span></p>
I suspect that something like
Code:
//h:p/h:span[re:test(.,'CHAPTER \d+','')]
ought to work. If that still generates four entries per chapter, I'd have a look at the source code if I were you. In that case, I'd suspect there really are four chapter headings hidden somewhere (think inline TOC if there is one...).
Manichean is offline   Reply With Quote
Old 01-03-2011, 08:07 AM   #7
cybmole
Wizard
cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.
 
Posts: 3,720
Karma: 1759970
Join Date: Sep 2010
Device: none
well your code worked fine. I put in in structure detection.

now I just need to understand how you did it !

testing your code without the plus sign now.... that also worked ( so my "subset of "argument is valid )

both times I get exactly 18 TOC entries corresponding to 18 chapters.

so it's some other difference between your code & my 1st attempt that is taking care of the duplicates ?
cybmole is offline   Reply With Quote
Old 01-03-2011, 08:18 AM   #8
Manichean
Wizard
Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.
 
Manichean's Avatar
 
Posts: 3,130
Karma: 91256
Join Date: Feb 2008
Location: Germany
Device: Cybook Gen3
You used
Code:
//*[re:test(., "CHAPTER \d", "i")]
which tests for any occurence of the word "chapter" in upper- or lowercase followed by a single number and inside any tag. My XPath expression tests for an uppercase "CHAPTER" followed by one or more numbers inside a span tag, which itself is inside a p tag. Your expression, for whatever reasons, fits four incidences per chapter. Look at the source and figure it out. And look at the XPath tutorial I linked to earlier.
Manichean is offline   Reply With Quote
Old 01-03-2011, 08:23 AM   #9
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
You could also try enabling preprocessing under structure detection for this sort of scenario. Aside from fixing hard line breaks it also looks for common chapter headings which aren't properly marked up and wraps them in <h2> tags. The scenario you're describing here is covered. In this particular case the default chapter detection xpath would work when combined with the preprocess option.

Edit - just saw this was epub source - in that case it's more complicated as epub is assumed to already be 'good'. You'd need to rename the file from .epub to .zip and add a new format as zipped html to be able to use the preprocess option. In most cases where this occurs it would be worth the effort though, since the file was more than likely originally converted from Lit to epub by someone else using an older version of Calibre, and since they didn't mark it up correctly it means that there are probably also page breaks scattered randomly throughout the book. I'd convert from zip back to epub using Calibre, then edit in Sigil and re-join the broken chapters together before trying to convert to mobi.

Last edited by ldolse; 01-03-2011 at 08:30 AM.
ldolse is offline   Reply With Quote
Old 01-03-2011, 08:31 AM   #10
cybmole
Wizard
cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.
 
Posts: 3,720
Karma: 1759970
Join Date: Sep 2010
Device: none
Quote:
Originally Posted by Manichean View Post
You used
Code:
//*[re:test(., "CHAPTER \d", "i")]
which tests for any occurence of the word "chapter" in upper- or lowercase followed by a single number and inside any tag. My XPath expression tests for an uppercase "CHAPTER" followed by one or more numbers inside a span tag, which itself is inside a p tag. Your expression, for whatever reasons, fits four incidences per chapter. Look at the source and figure it out. And look at the XPath tutorial I linked to earlier.
thanks for the extra explanation, but I still can't figure how i get exactly 4 in every chapter. FWIW all 4 toc entries jump to same chapter start.
let me post one sample chapter as zipped xhtml source & tell me what I'm missing here please. there are no strange files in the EPUB when I explode it, so my code is somehow deriving 4 instances from this source file
Attached Files
File Type: zip Section0014.zip (5.8 KB, 173 views)

Last edited by cybmole; 01-03-2011 at 08:33 AM.
cybmole is offline   Reply With Quote
Old 01-03-2011, 08:33 AM   #11
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
Quote:
Originally Posted by cybmole View Post
thanks for the extra explanation, but I still can't figure how i get exactly 4 in every chapter. FWIW all 4 toc entries jump to same chapter start.
let me post one sample chapter as xhtml sourc tell me what I'm missing here please
Your regex for whatever reason is matching four times. Each time it matches it generates a new TOC entry (to the same place). This is standard behavior for multiple xpath matches. Why it's matching multiple times is not clear, but since you've got a working solution I'd just stay away from what didn't work.
ldolse is offline   Reply With Quote
Old 01-03-2011, 08:42 AM   #12
cybmole
Wizard
cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.
 
Posts: 3,720
Karma: 1759970
Join Date: Sep 2010
Device: none
Quote:
Originally Posted by ldolse View Post

Edit - just saw this was epub source -
it began as a lit source, but I had all the structure detection / preprocess options OFF as default. ( thinking they were only needed for pdf, and that I could do any needed work with sigil)

I now try a similar book, going lit to mobi, this time with structure detection ON and force TOC on, and yes a TOC is created- ditto for lit to epub.

so the moral here is that I should go with the default preprocessing when working with .LIT sources I guess.

Lesson learnt - don't change defaults without understanding the consequences.

the 4 x duplicates remains a ( less important) puzzle though.

PS I did not "get" the value of TOC in novels, until I found that Kindle will jump to next / prev chapter via single click of its right or left arrow buttons, if there's a chapter based TOC in the book.
That makes a good TOC worth having.

Last edited by cybmole; 01-03-2011 at 08:49 AM.
cybmole is offline   Reply With Quote
Old 01-09-2011, 07:54 AM   #13
cybmole
Wizard
cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.
 
Posts: 3,720
Karma: 1759970
Join Date: Sep 2010
Device: none
i have another challenging book - there are chapters, within PARTS & the defaults only added PARTS to TOC.

I tried multiple hacks & eventually got chapter numbers without parts by telling xpath to look only for \d+ values, but lost / reformatted the PARTS code somewhere along the way.

when first converted (from a LIT source ) both the parts & the chapter numbers had h2 tags but detecting all h2 stubbornly refused to work.
but I cannot get both parts & chapters whatever I try ???

relevent code lines in my epub - after much to & fro between epub & mobi, now looks like
:
Code:
 <p class="calibre2"><span class="calibre4"><span class="bold">Part I</span></span></p><a class="calibre3"></a>
<p class="calibre2"><span class="calibre4"><span class="bold">9</span></span></p>
that is for part 1 and for chapter 9 as examples. what do I need in the chapter detection xpath to get both ?

I have tried a regex of \d*|Part I in the xpath wizard ,with class bold, but no joy. it finds all the chapter numbers but not the Part I, Part II etc.

If I go all the way back to lit, what can I add to default chapter detection so that it looks for stand-alone numbers as well as for part, section, chapter etc

Last edited by cybmole; 01-09-2011 at 08:16 AM.
cybmole is offline   Reply With Quote
Old 01-09-2011, 08:21 AM   #14
cybmole
Wizard
cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.
 
Posts: 3,720
Karma: 1759970
Join Date: Sep 2010
Device: none
update - it's OK - I got it - i went all the way back to original lit source and aadded \d*| to the default detection formula - I how have a good TOC with both parts and chapters within parts.

I do also find that lit to epub adds a lot of extra white space between paragraphs but I can fix that.
cybmole is offline   Reply With Quote
Old 01-09-2011, 07:01 PM   #15
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
Quote:
Originally Posted by cybmole View Post
I do also find that lit to epub adds a lot of extra white space between paragraphs but I can fix that.
A lot of lit files have large spaces between paragraphs encoded in them - use the 'remove paragraph spacing' option under look and feel to try and remove it.
ldolse is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
can't generate a toc from an html file p3aul Calibre 13 08-27-2010 05:44 AM
How not to auto-generate TOC in Calibre -setting? Jundle Calibre 0 05-05-2010 02:53 AM
Duplicates pauldadams Calibre 17 05-04-2010 11:57 PM
Duplicates... jaxx6166 Sony Reader 5 07-09-2009 09:13 PM
Making a TOC for LRFs? Issues with Calibre + LRF TOC editor not working Magitek LRF 0 05-06-2009 01:25 PM


All times are GMT -4. The time now is 09:51 PM.


MobileRead.com is a privately owned, operated and funded community.