04-29-2011, 03:54 PM | #1 |
Member
Posts: 16
Karma: 10
Join Date: Apr 2011
Device: Kindle
|
Interesting behavior of Structure Detection PDF to MOBI
Hi,
I want to convert a PDF to MOBI with TOC detection using the structure detection XPath feature. I tried two approaches with different and baffling results. Try 1: I convert the PDF to MOBI using the following XPath expression for structure detection: //*[((name()='span' or name()='h2') and re:test(.,'PRELUDE|chapter|ACKNOWLEDGMENTS|Chaos( \.)+', 'i'))] The resulting MOBI has a TOC that contains all chapters starting with the word chapter. But the PRELUDE and ACKNOWLEDGMENTS sections are not detected. By the way the reason for the 'span' is that PRELUDE and ACKNOWLEDGMENTS are included in such a tag. Try 2: If I do the conversion from PDF to MOBI with an empty XPATH expression no TOC gets generated. As expected ;O) If I now do a conversion from the generated MOBI to MOBI with the same XPath expression as shown above the TOC does contain the missing section headings as well. Why does this not work in the first step directly? Any ideas? The other thing I do not understand is, the following. If I do a MOBI to MOBI conversion based on the MOBI generated in Try 1, the TOC stays unchanged. Thus PRELUDE and ACKNOWLEDGMENTS are not added. Doesn't Calibre process the TOC from scratch? Or does it decide to leave the TOC unchanged if there is one already? During the process of "debugging" I stumbled over the explanation of the four conversion stages in the tutorial. Does the structure detection take the result of the input, parse or processed stage as input? Which of the generated html files do I have to analyze to get my XPath expression right? And at which stage do the other processing steps like heuristics, table of contents, search & replace, ... kick in? Regards Thomas |
04-29-2011, 04:22 PM | #2 |
Sigil & calibre developer
Posts: 2,488
Karma: 1063785
Join Date: Jan 2009
Location: Florida, USA
Device: Nook STR
|
The process is: Input -> OEB -> Output. So every time you convert you are going from the input format to XHTML to the output fomat. Different formats suppot different things. For instance an h1 tag might be changed to a bold text paragraph because thats the closest the output format supports to an h1. The intermediate OEB is always going to be different when converting the same book between formats. That is why the xpath matches for on input format but not the other.
I don't remember the stage in the debug output the different things you listed run in off the top of my head. If no one has answered by the time I get home from work I will look and post the answer. |
Advert | |
|
04-29-2011, 07:46 PM | #3 |
Member
Posts: 16
Karma: 10
Join Date: Apr 2011
Device: Kindle
|
Thanks I would really appreciate that.
Thomas |
04-29-2011, 07:49 PM | #4 |
Sigil & calibre developer
Posts: 2,488
Karma: 1063785
Join Date: Jan 2009
Location: Florida, USA
Device: Nook STR
|
I knew it was somewhere. In the README.txt in the debug directory:
This debug directory contains snapshots of the e-book as it passes through the various stages of conversion. The stages are: 1. input - This is the result of running the input plugin on the source file. Use this directory to debug the input plugin. 2. parsed - This is the result of preprocessing and parsing the output of the input plugin. Note that for some input plugins this will be identical to the input sub-directory. Use this directory to debug structure detection, etc. 3. structure - This corresponds to the stage in the pipeline when structure detection has run, but before the CSS is flattened. Use this directory to debug the CSS flattening, font size conversion, etc. 4. processed - This corresponds to the e-book as it is passed to the output plugin. Use this directory to debug the output plugin. |
04-29-2011, 09:43 PM | #5 |
Wizard
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
Generally speaking that readme is not quite correct - a lot preprocessing happens on the input stage - Heuristics is normally already executed on that output. However for pdf and mobi there is also an earlier debug file in the input directory that shows the actual output of the input plugin.
Heuristics does have a small list of words, and the word 'Prelude' isn't among them, though it could still get caught on one of the other heuristics patterns, like all uppercase letters, etc. The fact that the second conversion is getting it seems to indicate it's getting picked up at some point. You could also try simplifying the xpath to just use 'h2' - just click the magic wand next to the xpath and type h2 in the first box. |
Advert | |
|
05-01-2011, 07:20 AM | #6 |
Member
Posts: 16
Karma: 10
Join Date: Apr 2011
Device: Kindle
|
Hi,
I found the readme (the text is similar to the tutorial) but it is not quite clear to me. Does it mean the stucture detection takes the files contained in the input folder as input and the files contained in the parsed folder are the output resulting from structure detection? Like this: input folder files -- in --> structure detection -- out --> parsed folder files If this is the case. The next question is, does structure detection work on the debug-raw file or on the other one that has a file name matching the converted book? Concerning using only h2 it is something I did. If I convert from pdf to mobi with only looking for h2 the second step (mobi->mobi, prelude include) still does not match prelude. So somehow somehing seems to tell the structure detection to leave an already available TOC unchanged. Regards Thomas |
05-01-2011, 09:39 AM | #7 | |
Wizard
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
Quote:
|
|
05-04-2011, 03:36 PM | #8 |
Member
Posts: 16
Karma: 10
Join Date: Apr 2011
Device: Kindle
|
Were should I send the input file to?
Posting is not a got idea I guess. |
05-04-2011, 05:29 PM | #9 |
Wizard
Posts: 3,130
Karma: 91256
Join Date: Feb 2008
Location: Germany
Device: Cybook Gen3
|
Use the Calibre bug tracker.
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Structure Detection - Remove Header (or Footer) Regex | DarkKipper | Conversion | 69 | 11-09-2013 12:21 PM |
structure detection - documentation ? | cybmole | Calibre | 27 | 01-12-2011 02:14 AM |
Trouble w structure detection | jeff47 | Calibre | 1 | 10-13-2010 12:51 AM |
Structure Detection Ceased To Exist? | radiofred | Calibre | 3 | 10-01-2010 12:33 AM |
Structure detection v5.5 and v6.2 | AlexBell | Calibre | 2 | 07-29-2009 10:11 PM |