![]() |
#1 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 3,720
Karma: 1759970
Join Date: Sep 2010
Device: none
|
generate TOC duplicates puzzle
I am converting from epub to mobi and trying to generate a TOC.
the epub is split by chapter and the text CHAPTER occurs only as a chapter heading. i.e. the epub html file start CHAPTER 1 CHAPTER 2 etc. so I set a chapter detect regex as CHAPTER \d and force TOC generate. in the MOBI i do then get TOC but each chapter entry is there 4 times. each of the 4 link to the chapter start correctly, but I cant figure what is creating the duplicates i get e.g. 2 entries saying CHAPTER 1 ( folloed by 1st line of text) plus 2 entries saying only CHAPTER 1. please see uploaded screen grab of TOC in calibre viewer Last edited by cybmole; 01-03-2011 at 03:59 AM. |
![]() |
![]() |
![]() |
#2 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 3,130
Karma: 91256
Join Date: Feb 2008
Location: Germany
Device: Cybook Gen3
|
As usual, use the HTML interstage to create the chapter descriptions. It may just be that in the source, the word "chapter" occurs four times at the beginning of each chapter. Also, I believe the chapter detection is done by XPath instead of regexes, so if you do use regexes, results may vary.
Edit: Oh, and I assume you do know that Code:
CHAPTER \d Code:
CHAPTER \d+ |
![]() |
![]() |
Advert | |
|
![]() |
#3 | |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 3,720
Karma: 1759970
Join Date: Sep 2010
Device: none
|
Quote:
.. and .it seemed to work - the exact wizard generated expression is Code:
//*[re:test(., "CHAPTER \d", "i")] could it be there's an additional epub file that is being searched ? the rpub ( in calibre viewer) has nothing except START in its TOC. Last edited by cybmole; 01-03-2011 at 05:10 AM. |
|
![]() |
![]() |
![]() |
#4 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 3,720
Karma: 1759970
Join Date: Sep 2010
Device: none
|
update:
I tried again using simple CHAPTER, not CHAPTER \d similar results. TOC also contains a per chapter entry for NIGHT, which seems to match this piece of code ? Code:
<title>NIGHT</title> <link href="../Styles/stylesheet.css" rel="stylesheet" type="text/css" /> <style type="text/css"> /*<![CDATA[*/ @page { margin-bottom: 5.000000pt; margin-top: 5.000000pt; } /*]]>*/ PS is there a tutorial on xpath that I should study, rather than just wing it. PPS I know I can fix this manually by editing all CHAPTER occurence tags in sigil - from normal - to H1 or H2 - but that is tedious, & devising a regex for that is a bit beyond me. manually is open book in sigil- find CHAPTER, use drop down menu to change normal to H2, rinse & repeat... calibre will create a TOC once H2 tags are in place. So at present I'm trying to find an alternative way to derive a TOC, based on recognition of books' chapter labelling, which is usually CHAPTER ONE, CHAPTER TWO,etc or CHAPTER 1, CHAPTER 2 etc Last edited by cybmole; 01-03-2011 at 05:21 AM. |
![]() |
![]() |
![]() |
#5 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 3,720
Karma: 1759970
Join Date: Sep 2010
Device: none
|
[QUOTE=Manichean; Also, I believe the chapter detection is done by XPath instead of regexes, so if you do use regexes, results may vary.
.[/QUOTE] in the calibre path wizard screen there are 3 boxes to complete, it tells you to use a regex in the value box. the line of code I want to detect is eg Code:
<p class="MsoPlainText"><span>CHAPTER 2</span></p> so I used tag=*(default), attribute BLANK (default), value CHAPTER \D trying again with tag = span...... no difference... PS the "night" entry in toc reads Code:
night /*/@page{margin-bottom etc Last edited by cybmole; 01-03-2011 at 05:47 AM. |
![]() |
![]() |
Advert | |
|
![]() |
#6 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 3,130
Karma: 91256
Join Date: Feb 2008
Location: Germany
Device: Cybook Gen3
|
There's a XPath tutorial in the usual place.
Did you try using \d+ instead of \d? Looking at what you copied as a chapter heading Code:
<p class="MsoPlainText"><span>CHAPTER 2</span></p> Code:
//h:p/h:span[re:test(.,'CHAPTER \d+','')] |
![]() |
![]() |
![]() |
#7 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 3,720
Karma: 1759970
Join Date: Sep 2010
Device: none
|
well your code worked fine. I put in in structure detection.
now I just need to understand how you did it ! testing your code without the plus sign now.... that also worked ( so my "subset of "argument is valid ) both times I get exactly 18 TOC entries corresponding to 18 chapters. so it's some other difference between your code & my 1st attempt that is taking care of the duplicates ? |
![]() |
![]() |
![]() |
#8 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 3,130
Karma: 91256
Join Date: Feb 2008
Location: Germany
Device: Cybook Gen3
|
You used
Code:
//*[re:test(., "CHAPTER \d", "i")] |
![]() |
![]() |
![]() |
#9 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
You could also try enabling preprocessing under structure detection for this sort of scenario. Aside from fixing hard line breaks it also looks for common chapter headings which aren't properly marked up and wraps them in <h2> tags. The scenario you're describing here is covered. In this particular case the default chapter detection xpath would work when combined with the preprocess option.
Edit - just saw this was epub source - in that case it's more complicated as epub is assumed to already be 'good'. You'd need to rename the file from .epub to .zip and add a new format as zipped html to be able to use the preprocess option. In most cases where this occurs it would be worth the effort though, since the file was more than likely originally converted from Lit to epub by someone else using an older version of Calibre, and since they didn't mark it up correctly it means that there are probably also page breaks scattered randomly throughout the book. I'd convert from zip back to epub using Calibre, then edit in Sigil and re-join the broken chapters together before trying to convert to mobi. Last edited by ldolse; 01-03-2011 at 08:30 AM. |
![]() |
![]() |
![]() |
#10 | |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 3,720
Karma: 1759970
Join Date: Sep 2010
Device: none
|
Quote:
let me post one sample chapter as zipped xhtml source & tell me what I'm missing here please. there are no strange files in the EPUB when I explode it, so my code is somehow deriving 4 instances from this source file Last edited by cybmole; 01-03-2011 at 08:33 AM. |
|
![]() |
![]() |
![]() |
#11 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
Your regex for whatever reason is matching four times. Each time it matches it generates a new TOC entry (to the same place). This is standard behavior for multiple xpath matches. Why it's matching multiple times is not clear, but since you've got a working solution I'd just stay away from what didn't work.
|
![]() |
![]() |
![]() |
#12 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 3,720
Karma: 1759970
Join Date: Sep 2010
Device: none
|
it began as a lit source, but I had all the structure detection / preprocess options OFF as default. ( thinking they were only needed for pdf, and that I could do any needed work with sigil)
I now try a similar book, going lit to mobi, this time with structure detection ON and force TOC on, and yes a TOC is created- ditto for lit to epub. so the moral here is that I should go with the default preprocessing when working with .LIT sources I guess. Lesson learnt - don't change defaults without understanding the consequences. the 4 x duplicates remains a ( less important) puzzle though. PS I did not "get" the value of TOC in novels, until I found that Kindle will jump to next / prev chapter via single click of its right or left arrow buttons, if there's a chapter based TOC in the book. That makes a good TOC worth having. Last edited by cybmole; 01-03-2011 at 08:49 AM. |
![]() |
![]() |
![]() |
#13 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 3,720
Karma: 1759970
Join Date: Sep 2010
Device: none
|
i have another challenging book - there are chapters, within PARTS & the defaults only added PARTS to TOC.
I tried multiple hacks & eventually got chapter numbers without parts by telling xpath to look only for \d+ values, but lost / reformatted the PARTS code somewhere along the way. when first converted (from a LIT source ) both the parts & the chapter numbers had h2 tags but detecting all h2 stubbornly refused to work. but I cannot get both parts & chapters whatever I try ??? relevent code lines in my epub - after much to & fro between epub & mobi, now looks like : Code:
<p class="calibre2"><span class="calibre4"><span class="bold">Part I</span></span></p><a class="calibre3"></a> <p class="calibre2"><span class="calibre4"><span class="bold">9</span></span></p> I have tried a regex of \d*|Part I in the xpath wizard ,with class bold, but no joy. it finds all the chapter numbers but not the Part I, Part II etc. If I go all the way back to lit, what can I add to default chapter detection so that it looks for stand-alone numbers as well as for part, section, chapter etc Last edited by cybmole; 01-09-2011 at 08:16 AM. |
![]() |
![]() |
![]() |
#14 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 3,720
Karma: 1759970
Join Date: Sep 2010
Device: none
|
update - it's OK - I got it - i went all the way back to original lit source and aadded \d*| to the default detection formula - I how have a good TOC with both parts and chapters within parts.
I do also find that lit to epub adds a lot of extra white space between paragraphs but I can fix that. |
![]() |
![]() |
![]() |
#15 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
|
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
can't generate a toc from an html file | p3aul | Calibre | 13 | 08-27-2010 05:44 AM |
How not to auto-generate TOC in Calibre -setting? | Jundle | Calibre | 0 | 05-05-2010 02:53 AM |
Duplicates | pauldadams | Calibre | 17 | 05-04-2010 11:57 PM |
Duplicates... | jaxx6166 | Sony Reader | 5 | 07-09-2009 09:13 PM |
Making a TOC for LRFs? Issues with Calibre + LRF TOC editor not working | Magitek | LRF | 0 | 05-06-2009 01:25 PM |