MobileRead Forums - View Single Post - Interesting behavior of Structure Detection PDF to MOBI

tleon · 04-29-2011, 04:54 PM

Hi,

I want to convert a PDF to MOBI with TOC detection using the structure detection XPath feature.
I tried two approaches with different and baffling results.

Try 1:
I convert the PDF to MOBI using the following XPath expression for structure detection:
//*[((name()='span' or name()='h2') and re:test(.,'PRELUDE|chapter|ACKNOWLEDGMENTS|Chaos( \.)+', 'i'))]

The resulting MOBI has a TOC that contains all chapters starting with the word chapter. But the PRELUDE and ACKNOWLEDGMENTS sections are not detected.
By the way the reason for the 'span' is that PRELUDE and ACKNOWLEDGMENTS are included in such a tag.

Try 2:
If I do the conversion from PDF to MOBI with an empty XPATH expression no TOC gets generated. As expected ;O)

If I now do a conversion from the generated MOBI to MOBI with the same XPath expression as shown above the TOC does contain the missing section headings as well.

Why does this not work in the first step directly? Any ideas?

The other thing I do not understand is, the following.
If I do a MOBI to MOBI conversion based on the MOBI generated in Try 1, the TOC stays unchanged. Thus PRELUDE and ACKNOWLEDGMENTS are not added.
Doesn't Calibre process the TOC from scratch? Or does it decide to leave the TOC unchanged if there is one already?

During the process of "debugging" I stumbled over the explanation of the four conversion stages in the tutorial.
Does the structure detection take the result of the input, parse or processed stage as input?
Which of the generated html files do I have to analyze to get my XPath expression right?

And at which stage do the other processing steps like heuristics, table of contents, search & replace, ... kick in?

Regards
Thomas

04-29-2011, 04:54 PM	#1
tleon Member Posts: 16 Karma: 10 Join Date: Apr 2011 Device: Kindle	Interesting behavior of Structure Detection PDF to MOBI Hi, I want to convert a PDF to MOBI with TOC detection using the structure detection XPath feature. I tried two approaches with different and baffling results. Try 1: I convert the PDF to MOBI using the following XPath expression for structure detection: //*[((name()='span' or name()='h2') and re:test(.,'PRELUDE\|chapter\|ACKNOWLEDGMENTS\|Chaos( \.)+', 'i'))] The resulting MOBI has a TOC that contains all chapters starting with the word chapter. But the PRELUDE and ACKNOWLEDGMENTS sections are not detected. By the way the reason for the 'span' is that PRELUDE and ACKNOWLEDGMENTS are included in such a tag. Try 2: If I do the conversion from PDF to MOBI with an empty XPATH expression no TOC gets generated. As expected ;O) If I now do a conversion from the generated MOBI to MOBI with the same XPath expression as shown above the TOC does contain the missing section headings as well. Why does this not work in the first step directly? Any ideas? The other thing I do not understand is, the following. If I do a MOBI to MOBI conversion based on the MOBI generated in Try 1, the TOC stays unchanged. Thus PRELUDE and ACKNOWLEDGMENTS are not added. Doesn't Calibre process the TOC from scratch? Or does it decide to leave the TOC unchanged if there is one already? During the process of "debugging" I stumbled over the explanation of the four conversion stages in the tutorial. Does the structure detection take the result of the input, parse or processed stage as input? Which of the generated html files do I have to analyze to get my XPath expression right? And at which stage do the other processing steps like heuristics, table of contents, search & replace, ... kick in? Regards Thomas