structure detection - documentation ? - Page 2

kovidgoyal · 01-11-2011, 12:38 PM

I'm going to need the file to comment further.

cybmole · 01-11-2011, 01:00 PM

twice is not so bad, I had a case (in another thread) that was detecting everything x 4 !

xpath is a pig to configure. I spent ages earlier today telling structure detection to look for the default h1 h2 tags + part||section etc OR for stand-alone (chapter) numbers in bold & no matter how I tried I got "invalid" from the syntax checker. the wizard will not build "OR" structures so I was trying to adapt the deafult and change the OR ...cass = chapter...construct to [class = bold and text is of form \d* ] but no joy.

could someone please tell me if that is do-able.

I have books where chapters are numbered 1, 2 etc. but do not have h1 or h2 tags. if those chapters are within sections or parts then structure detection (assisted by preprocess) seems unable/unwilling to build a full TOC, it just does the easy stuff and builds a TOC of parts / sections

Wolfgan · 01-11-2011, 01:18 PM

Quote:

Originally Posted by cybmole

twice is not so bad, I had a case (in another thread) that was detecting everything x 4 !

xpath is a pig to configure. I spent ages earlier today telling structure detection to look for the default h1 h2 tags + part||section etc OR for stand-alone (chapter) numbers in bold & no matter how I tried I got "invalid" from the syntax checker. the wizard will not build "OR" structures so I was trying to adapt the deafult and change the OR ...cass = chapter...construct to [class = bold and text is of form \d* ] but no joy.

could someone please tell me if that is do-able.

I have books where chapters are numbered 1, 2 etc. but do not have h1 or h2 tags. if those chapters are within sections or parts then structure detection (assisted by preprocess) seems unable/unwilling to build a full TOC, it just does the easy stuff and builds a TOC of parts / sections

AFAIK XPath power comes from how good is your regular expression. Check this thread for ideas on how to modify your regex

Use the debug feature of the conversion process, and check the operation log window to tweak your regex expression.
Good luck! Wolf

cybmole · 01-11-2011, 01:36 PM

I have studies the example in that thread, plau other regex syntax stuff. I need to see a model answer for this particular challenge please. I suspect the xpah thingie in calibre may be limited in how much and/ or complexity it can accept.

I would also, for the sake of clarity, appreciate nailing down , for the three cases below iwhether calibre applies a) the structure detection line b) the preprocess option

ie complete the table with Y or N as needed?

apply preprocess apply structure detect
1. epub to epub N ? ???

2. epub to mobi

3. mobi to epub y ?

kovidgoyal · 01-11-2011, 01:43 PM

your xpath expression is matching both the p and the span tags. Use

Code:

//h:p

instead of //*

cybmole · 01-11-2011, 01:52 PM

update:
here's where I am stuck on syntax.
I want to pick out lines like this which are chapter starts, as well as still picking out sections and parts.

Code:

<p class="calibre8"><span class="calibre3 bold">7</span></p>

So I take the default structure detection expression & change the ending from class = chapter to class = bold. that is still valid & I can test it, but it finds far too much stuff, as expected.
so i try to amend the ending to test for both class = bold AND values are digits e.g.

Code:

//*[((name()='h1' or name()='h2') and re:test(., 'chapter|part\s+', 'i')) or (@class = 'bold' and re:test (\d*))]

i try lots of permutations of how many brackets to use & where to place them but I cannot get past the syntax checker

ldolse · 01-11-2011, 02:03 PM

Quote:

Originally Posted by cybmole

I have studies the example in that thread, plau other regex syntax stuff. I need to see a model answer for this particular challenge please. I suspect the xpah thingie in calibre may be limited in how much and/ or complexity it can accept.

I would also, for the sake of clarity, appreciate nailing down , for the three cases below iwhether calibre applies a) the structure detection line b) the preprocess option

ie complete the table with Y or N as needed?

apply preprocess apply structure detect
1. epub to epub N ? ???

2. epub to mobi

3. mobi to epub y ?

Preprocess happens before the xpath chapter detection. It never happens on an epub to anything conversion, as the conversion process bypasses that entire stage of the conversion pipeline for epub. All other formats can be preprocessed - IIRC mobi to epub at the moment doesn't go through the full preprocessing logic, though that could be changed if you are seeing a lot of badly formatted mobi files.

Preprocess will go through your document and look for common chapter headings using a heuristic type method - it should be able to mark up simple numeric headings like the ones you're listing in your code. If your doc already uses H1, h2, h3 tags, etc then the heuristic processor disables itself - you just need to look at your code and write the correct xpath.

If the preprocess stage finds a chapter header during its search it wraps the headings in <h2> tags. It wraps subtitles if they exist in <h3> tags.

If you have preprocess enabled and it's successfully detecting/marking up your chapters then you need to have the xpath look for <h2> tags - I often use the default XPATH and just change the regex to .*.

The xpath is processed later in the conversion processing of the document - I believe at the beginning of the output stage.

Also if you're having trouble getting the xpath right, but preprocess was successful, then Sigil will automatically create a TOC you can edit, as Sigil exclusively builds the TOC based off of headers like h1, h2, h3, h4 tags....

Wolfgan · 01-11-2011, 02:10 PM

Quote:

Originally Posted by kovidgoyal

your xpath expression is matching both the p and the span tags. Use

Code:

//h:p

instead of //*

Thanks a lot, that did the trick. I don't understand why it works as the span tag doesn't start with '#' , but the change certainly worked.
Thanks again, Wolf.

kovidgoyal · 01-11-2011, 02:14 PM

looking at the code you posted the first character inside the spa tag is indeed a #

Wolfgan · 01-11-2011, 02:26 PM

Quote:

Originally Posted by kovidgoyal

looking at the code you posted the first character inside the spa tag is indeed a #

Ohhh, now I see it why it matched twice (span tag inside of the p tag). Dumb of me!
Thanks for the tip, Wolf.

cybmole · 01-11-2011, 04:55 PM

Quote:

Originally Posted by ldolse

Preprocess will go through your document and look for common chapter headings using a heuristic type method - it should be able to mark up simple numeric headings like the ones you're listing in your code. If your doc already uses H1, h2, h3 tags, etc then the heuristic processor disables itself - you just need to look at your code and write the correct xpath.

If the preprocess stage finds a chapter header during its search it wraps the headings in <h2> tags. It wraps subtitles if they exist in <h3> tags.

....

yes, that often works but not always - it's either because a book has a chapters within parts / sections structure or because the book is littered with span tags, in the same html line that contains the chapter numbers - i am not yet sure which.

reading your explanation again, maybe the logic engine sees SOME h2 tags -say on the section headers, & disables itself before the chapter numbers are processed ?

PS thanks for explaining how the preprocess & xpath steps interact.

ldolse · 01-11-2011, 05:44 PM

Quote:

Originally Posted by cybmole

yes, that often works but not always - it's either because a book has a chapters within parts / sections structure or because the book is littered with span tags, in the same html line that contains the chapter numbers - i am not yet sure which.

reading your explanation again, maybe the logic engine sees SOME h2 tags -say on the section headers, & disables itself before the chapter numbers are processed ?

PS thanks for explaining how the preprocess & xpath steps interact.

It also does a check of the overall length of the book - there needs to be a certain amount of existing headers based on the length of the book before it will disable itself.

If you want open a bug with your book that's not working - I can see if I can improve the function, but I can't guarantee anything - there is an extremely wide range of html out there, some cases can't be easily handled in a general function.

cybmole · 01-12-2011, 02:14 AM

Quote:

Originally Posted by ldolse

It also does a check of the overall length of the book - there needs to be a certain amount of existing headers based on the length of the book before it will disable itself.

If you want open a bug with your book that's not working - I can see if I can improve the function, but I can't guarantee anything - there is an extremely wide range of html out there, some cases can't be easily handled in a general function.

thanks, I'm getting a better understanding now, & learnign how to fix up problem case in sigil, so I won't open a bug.

01-11-2011, 01:00 PM	#17
cybmole Wizard Posts: 3,720 Karma: 1759970 Join Date: Sep 2010 Device: none	twice is not so bad, I had a case (in another thread) that was detecting everything x 4 ! xpath is a pig to configure. I spent ages earlier today telling structure detection to look for the default h1 h2 tags + part\|\|section etc OR for stand-alone (chapter) numbers in bold & no matter how I tried I got "invalid" from the syntax checker. the wizard will not build "OR" structures so I was trying to adapt the deafult and change the OR ...cass = chapter...construct to [class = bold and text is of form \d* ] but no joy. could someone please tell me if that is do-able. I have books where chapters are numbered 1, 2 etc. but do not have h1 or h2 tags. if those chapters are within sections or parts then structure detection (assisted by preprocess) seems unable/unwilling to build a full TOC, it just does the easy stuff and builds a TOC of parts / sections Last edited by cybmole; 01-11-2011 at 01:12 PM.

01-11-2011, 01:43 PM	#20
kovidgoyal creator of calibre Posts: 45,575 Karma: 28548962 Join Date: Oct 2006 Location: Mumbai, India Device: Various	your xpath expression is matching both the p and the span tags. Use Code: //h:p instead of //* Last edited by kovidgoyal; 01-11-2011 at 01:54 PM.

01-11-2011, 01:52 PM	#21
cybmole Wizard Posts: 3,720 Karma: 1759970 Join Date: Sep 2010 Device: none	update: here's where I am stuck on syntax. I want to pick out lines like this which are chapter starts, as well as still picking out sections and parts. Code: <p class="calibre8"><span class="calibre3 bold">7</span></p> So I take the default structure detection expression & change the ending from class = chapter to class = bold. that is still valid & I can test it, but it finds far too much stuff, as expected. so i try to amend the ending to test for both class = bold AND values are digits e.g. Code: //[((name()='h1' or name()='h2') and re:test(., 'chapter\|part\s+', 'i')) or (@class = 'bold' and re:test (\d))] i try lots of permutations of how many brackets to use & where to place them but I cannot get past the syntax checker Last edited by cybmole; 01-11-2011 at 01:55 PM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Structure Detection - Remove Header (or Footer) Regex	DarkKipper	Conversion	69	11-09-2013 12:21 PM
Trouble w structure detection	jeff47	Calibre	1	10-13-2010 12:51 AM
epub - force a 2nd pass to improve structure detection ?	cybmole	Calibre	10	10-08-2010 01:00 AM
Structure Detection Ceased To Exist?	radiofred	Calibre	3	10-01-2010 12:33 AM
Structure detection v5.5 and v6.2	AlexBell	Calibre	2	07-29-2009 10:11 PM

01-11-2011, 12:38 PM	#16
kovidgoyal creator of calibre Posts: 45,575 Karma: 28548962 Join Date: Oct 2006 Location: Mumbai, India Device: Various	I'm going to need the file to comment further.

01-11-2011, 01:36 PM	#19
cybmole Wizard Posts: 3,720 Karma: 1759970 Join Date: Sep 2010 Device: none	I have studies the example in that thread, plau other regex syntax stuff. I need to see a model answer for this particular challenge please. I suspect the xpah thingie in calibre may be limited in how much and/ or complexity it can accept. I would also, for the sake of clarity, appreciate nailing down , for the three cases below iwhether calibre applies a) the structure detection line b) the preprocess option ie complete the table with Y or N as needed? apply preprocess apply structure detect 1. epub to epub N ? ??? 2. epub to mobi 3. mobi to epub y ?

01-11-2011, 02:14 PM	#24
kovidgoyal creator of calibre Posts: 45,575 Karma: 28548962 Join Date: Oct 2006 Location: Mumbai, India Device: Various	looking at the code you posted the first character inside the spa tag is indeed a #

Advert

Advert