Interesting behavior of Structure Detection PDF to MOBI

tleon · 04-29-2011, 03:54 PM

Hi,

I want to convert a PDF to MOBI with TOC detection using the structure detection XPath feature.
I tried two approaches with different and baffling results.

Try 1:
I convert the PDF to MOBI using the following XPath expression for structure detection:
//*[((name()='span' or name()='h2') and re:test(.,'PRELUDE|chapter|ACKNOWLEDGMENTS|Chaos( \.)+', 'i'))]

The resulting MOBI has a TOC that contains all chapters starting with the word chapter. But the PRELUDE and ACKNOWLEDGMENTS sections are not detected.
By the way the reason for the 'span' is that PRELUDE and ACKNOWLEDGMENTS are included in such a tag.

Try 2:
If I do the conversion from PDF to MOBI with an empty XPATH expression no TOC gets generated. As expected ;O)

If I now do a conversion from the generated MOBI to MOBI with the same XPath expression as shown above the TOC does contain the missing section headings as well.

Why does this not work in the first step directly? Any ideas?

The other thing I do not understand is, the following.
If I do a MOBI to MOBI conversion based on the MOBI generated in Try 1, the TOC stays unchanged. Thus PRELUDE and ACKNOWLEDGMENTS are not added.
Doesn't Calibre process the TOC from scratch? Or does it decide to leave the TOC unchanged if there is one already?

During the process of "debugging" I stumbled over the explanation of the four conversion stages in the tutorial.
Does the structure detection take the result of the input, parse or processed stage as input?
Which of the generated html files do I have to analyze to get my XPath expression right?

And at which stage do the other processing steps like heuristics, table of contents, search & replace, ... kick in?

Regards
Thomas

user_none · 04-29-2011, 04:22 PM

The process is: Input -> OEB -> Output. So every time you convert you are going from the input format to XHTML to the output fomat. Different formats suppot different things. For instance an h1 tag might be changed to a bold text paragraph because thats the closest the output format supports to an h1. The intermediate OEB is always going to be different when converting the same book between formats. That is why the xpath matches for on input format but not the other.

I don't remember the stage in the debug output the different things you listed run in off the top of my head. If no one has answered by the time I get home from work I will look and post the answer.

tleon · 04-29-2011, 07:46 PM

Thanks I would really appreciate that.

Thomas

user_none · 04-29-2011, 07:49 PM

I knew it was somewhere. In the README.txt in the debug directory:

This debug directory contains snapshots of the e-book as it passes through the
various stages of conversion. The stages are:

1. input - This is the result of running the input plugin on the source
file. Use this directory to debug the input plugin.

2. parsed - This is the result of preprocessing and parsing the output of
the input plugin. Note that for some input plugins this will be identical to
the input sub-directory. Use this directory to debug structure detection,
etc.

3. structure - This corresponds to the stage in the pipeline when structure
detection has run, but before the CSS is flattened. Use this directory to
debug the CSS flattening, font size conversion, etc.

4. processed - This corresponds to the e-book as it is passed to the output
plugin. Use this directory to debug the output plugin.

ldolse · 04-29-2011, 09:43 PM

Generally speaking that readme is not quite correct - a lot preprocessing happens on the input stage - Heuristics is normally already executed on that output. However for pdf and mobi there is also an earlier debug file in the input directory that shows the actual output of the input plugin.

Heuristics does have a small list of words, and the word 'Prelude' isn't among them, though it could still get caught on one of the other heuristics patterns, like all uppercase letters, etc. The fact that the second conversion is getting it seems to indicate it's getting picked up at some point.

You could also try simplifying the xpath to just use 'h2' - just click the magic wand next to the xpath and type h2 in the first box.

tleon · 05-01-2011, 07:20 AM

Hi,

I found the readme (the text is similar to the tutorial) but it is not quite clear to me.
Does it mean the stucture detection takes the files contained in the input folder as input and the files contained in the parsed folder are the output resulting from structure detection? Like this:

input folder files -- in --> structure detection -- out --> parsed folder files

If this is the case. The next question is, does structure detection work on the debug-raw file or on the other one that has a file name matching the converted book?

Concerning using only h2 it is something I did.
If I convert from pdf to mobi with only looking for h2 the second step (mobi->mobi, prelude include) still does not match prelude. So somehow somehing seems to tell the structure detection to leave an already available TOC unchanged.

Regards
Thomas

ldolse · 05-01-2011, 09:39 AM

Quote:

Originally Posted by tleon

Hi,

I found the readme (the text is similar to the tutorial) but it is not quite clear to me.
Does it mean the stucture detection takes the files contained in the input folder as input and the files contained in the parsed folder are the output resulting from structure detection? Like this:

input folder files -- in --> structure detection -- out --> parsed folder files

If this is the case. The next question is, does structure detection work on the debug-raw file or on the other one that has a file name matching the converted book?

Concerning using only h2 it is something I did.
If I convert from pdf to mobi with only looking for h2 the second step (mobi->mobi, prelude include) still does not match prelude. So somehow somehing seems to tell the structure detection to leave an already available TOC unchanged.

Regards
Thomas

If you open a bug with the pdf I'll be happy to look at it and tweak heuristics, but aside from adding 'prelude' to the existing list of hard-coded words in heuristics there's not really much to be done without a test case.

tleon · 05-04-2011, 03:36 PM

Were should I send the input file to?
Posting is not a got idea I guess.

Manichean · 05-04-2011, 05:29 PM

Use the Calibre bug tracker.

04-29-2011, 03:54 PM	#1
tleon Member Posts: 16 Karma: 10 Join Date: Apr 2011 Device: Kindle	Interesting behavior of Structure Detection PDF to MOBI Hi, I want to convert a PDF to MOBI with TOC detection using the structure detection XPath feature. I tried two approaches with different and baffling results. Try 1: I convert the PDF to MOBI using the following XPath expression for structure detection: //*[((name()='span' or name()='h2') and re:test(.,'PRELUDE\|chapter\|ACKNOWLEDGMENTS\|Chaos( \.)+', 'i'))] The resulting MOBI has a TOC that contains all chapters starting with the word chapter. But the PRELUDE and ACKNOWLEDGMENTS sections are not detected. By the way the reason for the 'span' is that PRELUDE and ACKNOWLEDGMENTS are included in such a tag. Try 2: If I do the conversion from PDF to MOBI with an empty XPATH expression no TOC gets generated. As expected ;O) If I now do a conversion from the generated MOBI to MOBI with the same XPath expression as shown above the TOC does contain the missing section headings as well. Why does this not work in the first step directly? Any ideas? The other thing I do not understand is, the following. If I do a MOBI to MOBI conversion based on the MOBI generated in Try 1, the TOC stays unchanged. Thus PRELUDE and ACKNOWLEDGMENTS are not added. Doesn't Calibre process the TOC from scratch? Or does it decide to leave the TOC unchanged if there is one already? During the process of "debugging" I stumbled over the explanation of the four conversion stages in the tutorial. Does the structure detection take the result of the input, parse or processed stage as input? Which of the generated html files do I have to analyze to get my XPath expression right? And at which stage do the other processing steps like heuristics, table of contents, search & replace, ... kick in? Regards Thomas

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Structure Detection - Remove Header (or Footer) Regex	DarkKipper	Conversion	69	11-09-2013 12:21 PM
structure detection - documentation ?	cybmole	Calibre	27	01-12-2011 02:14 AM
Trouble w structure detection	jeff47	Calibre	1	10-13-2010 12:51 AM
Structure Detection Ceased To Exist?	radiofred	Calibre	3	10-01-2010 12:33 AM
Structure detection v5.5 and v6.2	AlexBell	Calibre	2	07-29-2009 10:11 PM

04-29-2011, 04:22 PM	#2
user_none Sigil & calibre developer Posts: 2,488 Karma: 1063785 Join Date: Jan 2009 Location: Florida, USA Device: Nook STR	The process is: Input -> OEB -> Output. So every time you convert you are going from the input format to XHTML to the output fomat. Different formats suppot different things. For instance an h1 tag might be changed to a bold text paragraph because thats the closest the output format supports to an h1. The intermediate OEB is always going to be different when converting the same book between formats. That is why the xpath matches for on input format but not the other. I don't remember the stage in the debug output the different things you listed run in off the top of my head. If no one has answered by the time I get home from work I will look and post the answer.

04-29-2011, 07:46 PM	#3
tleon Member Posts: 16 Karma: 10 Join Date: Apr 2011 Device: Kindle	Thanks I would really appreciate that. Thomas

04-29-2011, 07:49 PM	#4
user_none Sigil & calibre developer Posts: 2,488 Karma: 1063785 Join Date: Jan 2009 Location: Florida, USA Device: Nook STR	I knew it was somewhere. In the README.txt in the debug directory: This debug directory contains snapshots of the e-book as it passes through the various stages of conversion. The stages are: 1. input - This is the result of running the input plugin on the source file. Use this directory to debug the input plugin. 2. parsed - This is the result of preprocessing and parsing the output of the input plugin. Note that for some input plugins this will be identical to the input sub-directory. Use this directory to debug structure detection, etc. 3. structure - This corresponds to the stage in the pipeline when structure detection has run, but before the CSS is flattened. Use this directory to debug the CSS flattening, font size conversion, etc. 4. processed - This corresponds to the e-book as it is passed to the output plugin. Use this directory to debug the output plugin.

04-29-2011, 09:43 PM	#5
ldolse Wizard Posts: 1,337 Karma: 123455 Join Date: Apr 2009 Location: Malaysia Device: PRS-650, iPhone	Generally speaking that readme is not quite correct - a lot preprocessing happens on the input stage - Heuristics is normally already executed on that output. However for pdf and mobi there is also an earlier debug file in the input directory that shows the actual output of the input plugin. Heuristics does have a small list of words, and the word 'Prelude' isn't among them, though it could still get caught on one of the other heuristics patterns, like all uppercase letters, etc. The fact that the second conversion is getting it seems to indicate it's getting picked up at some point. You could also try simplifying the xpath to just use 'h2' - just click the magic wand next to the xpath and type h2 in the first box.

05-01-2011, 07:20 AM	#6
tleon Member Posts: 16 Karma: 10 Join Date: Apr 2011 Device: Kindle	Hi, I found the readme (the text is similar to the tutorial) but it is not quite clear to me. Does it mean the stucture detection takes the files contained in the input folder as input and the files contained in the parsed folder are the output resulting from structure detection? Like this: input folder files -- in --> structure detection -- out --> parsed folder files If this is the case. The next question is, does structure detection work on the debug-raw file or on the other one that has a file name matching the converted book? Concerning using only h2 it is something I did. If I convert from pdf to mobi with only looking for h2 the second step (mobi->mobi, prelude include) still does not match prelude. So somehow somehing seems to tell the structure detection to leave an already available TOC unchanged. Regards Thomas

05-04-2011, 03:36 PM	#8
tleon Member Posts: 16 Karma: 10 Join Date: Apr 2011 Device: Kindle	Were should I send the input file to? Posting is not a got idea I guess.

05-04-2011, 05:29 PM	#9
Manichean Wizard Posts: 3,130 Karma: 91256 Join Date: Feb 2008 Location: Germany Device: Cybook Gen3	Use the Calibre bug tracker.

Advert

Advert