Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Conversion

Notices

Reply
 
Thread Tools Search this Thread
Old 04-29-2011, 03:54 PM   #1
tleon
Member
tleon began at the beginning.
 
Posts: 16
Karma: 10
Join Date: Apr 2011
Device: Kindle
Interesting behavior of Structure Detection PDF to MOBI

Hi,

I want to convert a PDF to MOBI with TOC detection using the structure detection XPath feature.
I tried two approaches with different and baffling results.

Try 1:
I convert the PDF to MOBI using the following XPath expression for structure detection:
//*[((name()='span' or name()='h2') and re:test(.,'PRELUDE|chapter|ACKNOWLEDGMENTS|Chaos( \.)+', 'i'))]

The resulting MOBI has a TOC that contains all chapters starting with the word chapter. But the PRELUDE and ACKNOWLEDGMENTS sections are not detected.
By the way the reason for the 'span' is that PRELUDE and ACKNOWLEDGMENTS are included in such a tag.


Try 2:
If I do the conversion from PDF to MOBI with an empty XPATH expression no TOC gets generated. As expected ;O)

If I now do a conversion from the generated MOBI to MOBI with the same XPath expression as shown above the TOC does contain the missing section headings as well.

Why does this not work in the first step directly? Any ideas?

The other thing I do not understand is, the following.
If I do a MOBI to MOBI conversion based on the MOBI generated in Try 1, the TOC stays unchanged. Thus PRELUDE and ACKNOWLEDGMENTS are not added.
Doesn't Calibre process the TOC from scratch? Or does it decide to leave the TOC unchanged if there is one already?


During the process of "debugging" I stumbled over the explanation of the four conversion stages in the tutorial.
Does the structure detection take the result of the input, parse or processed stage as input?
Which of the generated html files do I have to analyze to get my XPath expression right?

And at which stage do the other processing steps like heuristics, table of contents, search & replace, ... kick in?

Regards
Thomas
tleon is offline   Reply With Quote
Old 04-29-2011, 04:22 PM   #2
user_none
Sigil & calibre developer
user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.
 
user_none's Avatar
 
Posts: 2,488
Karma: 1063785
Join Date: Jan 2009
Location: Florida, USA
Device: Nook STR
The process is: Input -> OEB -> Output. So every time you convert you are going from the input format to XHTML to the output fomat. Different formats suppot different things. For instance an h1 tag might be changed to a bold text paragraph because thats the closest the output format supports to an h1. The intermediate OEB is always going to be different when converting the same book between formats. That is why the xpath matches for on input format but not the other.

I don't remember the stage in the debug output the different things you listed run in off the top of my head. If no one has answered by the time I get home from work I will look and post the answer.
user_none is offline   Reply With Quote
Advert
Old 04-29-2011, 07:46 PM   #3
tleon
Member
tleon began at the beginning.
 
Posts: 16
Karma: 10
Join Date: Apr 2011
Device: Kindle
Thanks I would really appreciate that.

Thomas
tleon is offline   Reply With Quote
Old 04-29-2011, 07:49 PM   #4
user_none
Sigil & calibre developer
user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.
 
user_none's Avatar
 
Posts: 2,488
Karma: 1063785
Join Date: Jan 2009
Location: Florida, USA
Device: Nook STR
I knew it was somewhere. In the README.txt in the debug directory:



This debug directory contains snapshots of the e-book as it passes through the
various stages of conversion. The stages are:

1. input - This is the result of running the input plugin on the source
file. Use this directory to debug the input plugin.

2. parsed - This is the result of preprocessing and parsing the output of
the input plugin. Note that for some input plugins this will be identical to
the input sub-directory. Use this directory to debug structure detection,
etc.

3. structure - This corresponds to the stage in the pipeline when structure
detection has run, but before the CSS is flattened. Use this directory to
debug the CSS flattening, font size conversion, etc.

4. processed - This corresponds to the e-book as it is passed to the output
plugin. Use this directory to debug the output plugin.
user_none is offline   Reply With Quote
Old 04-29-2011, 09:43 PM   #5
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
Generally speaking that readme is not quite correct - a lot preprocessing happens on the input stage - Heuristics is normally already executed on that output. However for pdf and mobi there is also an earlier debug file in the input directory that shows the actual output of the input plugin.

Heuristics does have a small list of words, and the word 'Prelude' isn't among them, though it could still get caught on one of the other heuristics patterns, like all uppercase letters, etc. The fact that the second conversion is getting it seems to indicate it's getting picked up at some point.

You could also try simplifying the xpath to just use 'h2' - just click the magic wand next to the xpath and type h2 in the first box.
ldolse is offline   Reply With Quote
Advert
Old 05-01-2011, 07:20 AM   #6
tleon
Member
tleon began at the beginning.
 
Posts: 16
Karma: 10
Join Date: Apr 2011
Device: Kindle
Hi,

I found the readme (the text is similar to the tutorial) but it is not quite clear to me.
Does it mean the stucture detection takes the files contained in the input folder as input and the files contained in the parsed folder are the output resulting from structure detection? Like this:

input folder files -- in --> structure detection -- out --> parsed folder files

If this is the case. The next question is, does structure detection work on the debug-raw file or on the other one that has a file name matching the converted book?

Concerning using only h2 it is something I did.
If I convert from pdf to mobi with only looking for h2 the second step (mobi->mobi, prelude include) still does not match prelude. So somehow somehing seems to tell the structure detection to leave an already available TOC unchanged.

Regards
Thomas
tleon is offline   Reply With Quote
Old 05-01-2011, 09:39 AM   #7
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
Quote:
Originally Posted by tleon View Post
Hi,

I found the readme (the text is similar to the tutorial) but it is not quite clear to me.
Does it mean the stucture detection takes the files contained in the input folder as input and the files contained in the parsed folder are the output resulting from structure detection? Like this:

input folder files -- in --> structure detection -- out --> parsed folder files

If this is the case. The next question is, does structure detection work on the debug-raw file or on the other one that has a file name matching the converted book?

Concerning using only h2 it is something I did.
If I convert from pdf to mobi with only looking for h2 the second step (mobi->mobi, prelude include) still does not match prelude. So somehow somehing seems to tell the structure detection to leave an already available TOC unchanged.

Regards
Thomas
If you open a bug with the pdf I'll be happy to look at it and tweak heuristics, but aside from adding 'prelude' to the existing list of hard-coded words in heuristics there's not really much to be done without a test case.
ldolse is offline   Reply With Quote
Old 05-04-2011, 03:36 PM   #8
tleon
Member
tleon began at the beginning.
 
Posts: 16
Karma: 10
Join Date: Apr 2011
Device: Kindle
Were should I send the input file to?
Posting is not a got idea I guess.
tleon is offline   Reply With Quote
Old 05-04-2011, 05:29 PM   #9
Manichean
Wizard
Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.
 
Manichean's Avatar
 
Posts: 3,130
Karma: 91256
Join Date: Feb 2008
Location: Germany
Device: Cybook Gen3
Use the Calibre bug tracker.
Manichean is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Structure Detection - Remove Header (or Footer) Regex DarkKipper Conversion 69 11-09-2013 12:21 PM
structure detection - documentation ? cybmole Calibre 27 01-12-2011 02:14 AM
Trouble w structure detection jeff47 Calibre 1 10-13-2010 12:51 AM
Structure Detection Ceased To Exist? radiofred Calibre 3 10-01-2010 12:33 AM
Structure detection v5.5 and v6.2 AlexBell Calibre 2 07-29-2009 10:11 PM


All times are GMT -4. The time now is 03:26 AM.


MobileRead.com is a privately owned, operated and funded community.