Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre

Notices

Reply
 
Thread Tools Search this Thread
Old 01-11-2011, 12:38 PM   #16
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 45,393
Karma: 27756918
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
I'm going to need the file to comment further.
kovidgoyal is online now   Reply With Quote
Old 01-11-2011, 01:00 PM   #17
cybmole
Wizard
cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.
 
Posts: 3,720
Karma: 1759970
Join Date: Sep 2010
Device: none
twice is not so bad, I had a case (in another thread) that was detecting everything x 4 !

xpath is a pig to configure. I spent ages earlier today telling structure detection to look for the default h1 h2 tags + part||section etc OR for stand-alone (chapter) numbers in bold & no matter how I tried I got "invalid" from the syntax checker. the wizard will not build "OR" structures so I was trying to adapt the deafult and change the OR ...cass = chapter...construct to [class = bold and text is of form \d* ] but no joy.

could someone please tell me if that is do-able.

I have books where chapters are numbered 1, 2 etc. but do not have h1 or h2 tags. if those chapters are within sections or parts then structure detection (assisted by preprocess) seems unable/unwilling to build a full TOC, it just does the easy stuff and builds a TOC of parts / sections

Last edited by cybmole; 01-11-2011 at 01:12 PM.
cybmole is offline   Reply With Quote
Advert
Old 01-11-2011, 01:18 PM   #18
Wolfgan
Avid reader
Wolfgan began at the beginning.
 
Wolfgan's Avatar
 
Posts: 19
Karma: 10
Join Date: Feb 2009
Location: Argentina
Device: Kindle 3 wifi
Quote:
Originally Posted by cybmole View Post
twice is not so bad, I had a case (in another thread) that was detecting everything x 4 !

xpath is a pig to configure. I spent ages earlier today telling structure detection to look for the default h1 h2 tags + part||section etc OR for stand-alone (chapter) numbers in bold & no matter how I tried I got "invalid" from the syntax checker. the wizard will not build "OR" structures so I was trying to adapt the deafult and change the OR ...cass = chapter...construct to [class = bold and text is of form \d* ] but no joy.

could someone please tell me if that is do-able.

I have books where chapters are numbered 1, 2 etc. but do not have h1 or h2 tags. if those chapters are within sections or parts then structure detection (assisted by preprocess) seems unable/unwilling to build a full TOC, it just does the easy stuff and builds a TOC of parts / sections
AFAIK XPath power comes from how good is your regular expression. Check this thread for ideas on how to modify your regex

Use the debug feature of the conversion process, and check the operation log window to tweak your regex expression.
Good luck! Wolf
Wolfgan is offline   Reply With Quote
Old 01-11-2011, 01:36 PM   #19
cybmole
Wizard
cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.
 
Posts: 3,720
Karma: 1759970
Join Date: Sep 2010
Device: none
I have studies the example in that thread, plau other regex syntax stuff. I need to see a model answer for this particular challenge please. I suspect the xpah thingie in calibre may be limited in how much and/ or complexity it can accept.

I would also, for the sake of clarity, appreciate nailing down , for the three cases below iwhether calibre applies a) the structure detection line b) the preprocess option

ie complete the table with Y or N as needed?

apply preprocess apply structure detect
1. epub to epub N ? ???

2. epub to mobi

3. mobi to epub y ?
cybmole is offline   Reply With Quote
Old 01-11-2011, 01:43 PM   #20
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 45,393
Karma: 27756918
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
your xpath expression is matching both the p and the span tags. Use

Code:
//h:p

instead of //*

Last edited by kovidgoyal; 01-11-2011 at 01:54 PM.
kovidgoyal is online now   Reply With Quote
Advert
Old 01-11-2011, 01:52 PM   #21
cybmole
Wizard
cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.
 
Posts: 3,720
Karma: 1759970
Join Date: Sep 2010
Device: none
update:
here's where I am stuck on syntax.
I want to pick out lines like this which are chapter starts, as well as still picking out sections and parts.
Code:
<p class="calibre8"><span class="calibre3 bold">7</span></p>
So I take the default structure detection expression & change the ending from class = chapter to class = bold. that is still valid & I can test it, but it finds far too much stuff, as expected.
so i try to amend the ending to test for both class = bold AND values are digits e.g.
Code:
//*[((name()='h1' or name()='h2') and re:test(., 'chapter|part\s+', 'i')) or (@class = 'bold' and re:test (\d*))]
i try lots of permutations of how many brackets to use & where to place them but I cannot get past the syntax checker

Last edited by cybmole; 01-11-2011 at 01:55 PM.
cybmole is offline   Reply With Quote
Old 01-11-2011, 02:03 PM   #22
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
Quote:
Originally Posted by cybmole View Post
I have studies the example in that thread, plau other regex syntax stuff. I need to see a model answer for this particular challenge please. I suspect the xpah thingie in calibre may be limited in how much and/ or complexity it can accept.

I would also, for the sake of clarity, appreciate nailing down , for the three cases below iwhether calibre applies a) the structure detection line b) the preprocess option

ie complete the table with Y or N as needed?

apply preprocess apply structure detect
1. epub to epub N ? ???

2. epub to mobi

3. mobi to epub y ?
Preprocess happens before the xpath chapter detection. It never happens on an epub to anything conversion, as the conversion process bypasses that entire stage of the conversion pipeline for epub. All other formats can be preprocessed - IIRC mobi to epub at the moment doesn't go through the full preprocessing logic, though that could be changed if you are seeing a lot of badly formatted mobi files.

Preprocess will go through your document and look for common chapter headings using a heuristic type method - it should be able to mark up simple numeric headings like the ones you're listing in your code. If your doc already uses H1, h2, h3 tags, etc then the heuristic processor disables itself - you just need to look at your code and write the correct xpath.

If the preprocess stage finds a chapter header during its search it wraps the headings in <h2> tags. It wraps subtitles if they exist in <h3> tags.

If you have preprocess enabled and it's successfully detecting/marking up your chapters then you need to have the xpath look for <h2> tags - I often use the default XPATH and just change the regex to .*.

The xpath is processed later in the conversion processing of the document - I believe at the beginning of the output stage.

Also if you're having trouble getting the xpath right, but preprocess was successful, then Sigil will automatically create a TOC you can edit, as Sigil exclusively builds the TOC based off of headers like h1, h2, h3, h4 tags....

Last edited by ldolse; 01-11-2011 at 02:14 PM.
ldolse is offline   Reply With Quote
Old 01-11-2011, 02:10 PM   #23
Wolfgan
Avid reader
Wolfgan began at the beginning.
 
Wolfgan's Avatar
 
Posts: 19
Karma: 10
Join Date: Feb 2009
Location: Argentina
Device: Kindle 3 wifi
Quote:
Originally Posted by kovidgoyal View Post
your xpath expression is matching both the p and the span tags. Use

Code:
//h:p

instead of //*
Thanks a lot, that did the trick. I don't understand why it works as the span tag doesn't start with '#' , but the change certainly worked.
Thanks again, Wolf.
Wolfgan is offline   Reply With Quote
Old 01-11-2011, 02:14 PM   #24
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 45,393
Karma: 27756918
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
looking at the code you posted the first character inside the spa tag is indeed a #
kovidgoyal is online now   Reply With Quote
Old 01-11-2011, 02:26 PM   #25
Wolfgan
Avid reader
Wolfgan began at the beginning.
 
Wolfgan's Avatar
 
Posts: 19
Karma: 10
Join Date: Feb 2009
Location: Argentina
Device: Kindle 3 wifi
Quote:
Originally Posted by kovidgoyal View Post
looking at the code you posted the first character inside the spa tag is indeed a #
Ohhh, now I see it why it matched twice (span tag inside of the p tag). Dumb of me!
Thanks for the tip, Wolf.
Wolfgan is offline   Reply With Quote
Old 01-11-2011, 04:55 PM   #26
cybmole
Wizard
cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.
 
Posts: 3,720
Karma: 1759970
Join Date: Sep 2010
Device: none
Quote:
Originally Posted by ldolse View Post
Preprocess will go through your document and look for common chapter headings using a heuristic type method - it should be able to mark up simple numeric headings like the ones you're listing in your code. If your doc already uses H1, h2, h3 tags, etc then the heuristic processor disables itself - you just need to look at your code and write the correct xpath.

If the preprocess stage finds a chapter header during its search it wraps the headings in <h2> tags. It wraps subtitles if they exist in <h3> tags.

....
yes, that often works but not always - it's either because a book has a chapters within parts / sections structure or because the book is littered with span tags, in the same html line that contains the chapter numbers - i am not yet sure which.

reading your explanation again, maybe the logic engine sees SOME h2 tags -say on the section headers, & disables itself before the chapter numbers are processed ?

PS thanks for explaining how the preprocess & xpath steps interact.

Last edited by cybmole; 01-11-2011 at 04:58 PM.
cybmole is offline   Reply With Quote
Old 01-11-2011, 05:44 PM   #27
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
Quote:
Originally Posted by cybmole View Post
yes, that often works but not always - it's either because a book has a chapters within parts / sections structure or because the book is littered with span tags, in the same html line that contains the chapter numbers - i am not yet sure which.

reading your explanation again, maybe the logic engine sees SOME h2 tags -say on the section headers, & disables itself before the chapter numbers are processed ?

PS thanks for explaining how the preprocess & xpath steps interact.
It also does a check of the overall length of the book - there needs to be a certain amount of existing headers based on the length of the book before it will disable itself.

If you want open a bug with your book that's not working - I can see if I can improve the function, but I can't guarantee anything - there is an extremely wide range of html out there, some cases can't be easily handled in a general function.

Last edited by ldolse; 01-11-2011 at 11:21 PM.
ldolse is offline   Reply With Quote
Old 01-12-2011, 02:14 AM   #28
cybmole
Wizard
cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.
 
Posts: 3,720
Karma: 1759970
Join Date: Sep 2010
Device: none
Quote:
Originally Posted by ldolse View Post
It also does a check of the overall length of the book - there needs to be a certain amount of existing headers based on the length of the book before it will disable itself.

If you want open a bug with your book that's not working - I can see if I can improve the function, but I can't guarantee anything - there is an extremely wide range of html out there, some cases can't be easily handled in a general function.
thanks, I'm getting a better understanding now, & learnign how to fix up problem case in sigil, so I won't open a bug.
cybmole is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Structure Detection - Remove Header (or Footer) Regex DarkKipper Conversion 69 11-09-2013 12:21 PM
Trouble w structure detection jeff47 Calibre 1 10-13-2010 12:51 AM
epub - force a 2nd pass to improve structure detection ? cybmole Calibre 10 10-08-2010 01:00 AM
Structure Detection Ceased To Exist? radiofred Calibre 3 10-01-2010 12:33 AM
Structure detection v5.5 and v6.2 AlexBell Calibre 2 07-29-2009 10:11 PM


All times are GMT -4. The time now is 12:11 AM.


MobileRead.com is a privately owned, operated and funded community.