structure detection - documentation ?

cybmole · 01-05-2011, 06:47 AM

I am told in other threads that structure detection option is ignored for epub source. Is that true for any other source formats or epub only ?

what's the recommended way to force structure detection on epub - is it convert to zip then back again ?

is there / should there be any detailed documentation for the above + of what (processing operations) structure detection actually does. it seems like a tick-it-&-see black box thingie at present ?

Manichean · 01-05-2011, 07:43 AM

What structure detection are you referring to? I've used header/footer removal, which is part of the structure detection conversion settings, successfully on ePub, so I don't think it's ignored. Preprocessing might be, though.
As for documentation, as usual, refer to the manual.

cybmole · 01-05-2011, 08:44 AM

in was in a discussion about TOC & chapter detection. I wanted to automate having chapters flagged with h1 or h2 tags so that they would add to TOC.
someone ( not you) said that structure detection is not applied if source format = epub because source is assumed to be "already good". it could be that they meant only "preprocess to imporve structure detection" box
so a rephrase of my Q is: for what source formats is preprocess... tick box ignored ?

I re-read the manual but it does not define exactly what takes place when the box labelled preprocessing is ticked. the manual gives an overview of what it's intended to do, but I was wanting a programmer's definiton of what logic is apllied & how.

Manichean · 01-05-2011, 09:00 AM

Ah, yes, I remember that discussion. Try asking for (or waiting for) whoever programmed the preprocessing engine to explain.
As for a workaround, I believe that adding the ePub as a ZIP file and reconverting ought to work.

cybmole · 01-05-2011, 09:14 AM

Quote:

Originally Posted by Manichean

Ah, yes, I remember that discussion. Try asking for (or waiting for) whoever programmed the preprocessing engine to explain.
As for a workaround, I believe that adding the ePub as a ZIP file and reconverting ought to work.

based on a 1 book test -converting from epub to zip, followed by converting from zip to epub mangles the chapter running order and also inserts some wierd prev / next html pages when viewed in sigil.

my input book was perfectly ordered, but my output from the test began at chapter 19 !

maybe copying the epub, renaming it as zip then adding it to the library is safe, but converting from epub to zip seems very flawed, unless I am misunderstanding what the conversion is meant to achieve.

Manichean · 01-05-2011, 09:25 AM

I was thinking about exporting the ePub, renaming it as a ZIP, then reimporting it into the book and converting to ePub. You might want to store the ePub somewhere safe externally, as it is going to be overwritten.

As for the chapter order, there's a FAQ entry relevant to file ordering in multi-HTML books. I should have a small script still floating around for creating the index file discussed in that FAQ entry that I hacked together once for the purpose of converting a HTML reference book, but it only cares about what the first file is and dumps the rest of the files in as it finds them. Still, should that be useful, I can post it here.

cybmole · 01-05-2011, 09:37 AM

seems to me that calibre should not allow conversion options that will lose or destroy chapter ordering - or should at least warn against them

is there any e-book related need for calibre to offer an epub to zip conversion, or should should its epub to zip conversion option be disabled , along with any other combinations that do not preserve chapter ordering ?

yes, I'd taken a backup of the epub but others may be caught out.

Manichean · 01-05-2011, 09:54 AM

It is called YAFIYGI, as opposed to WYSIWYG. Deal with it.

ldolse · 01-05-2011, 10:21 AM

Calibre won't insert previous/next navigation - that would have to be built into the original source.

The only structure detection option that doesn't work on epub is the 'preprocess html to possibly improve structure detection'.

This is due to the way the conversion pipeline works and that Calibre treats OEB as a sort of 'reference' format. All filetypes are converted from their native format to OEB internally, and then re-converted from OEB to whatever the desired output format is. Preprocessing occurs BEFORE conversion to OEB.

Epub IS OEB, just in a zipped container. Therefore Calibre doesn't bother converting from epub to OEB, it just unzips it and goes from there. Because of this epub bypasses the preprocess stage of the conversion pipeline.

If you want the preprocess option to work on an epub just rename the epub to .zip instead of .epub and add a the zip back to the same book record as zipped html. HTML goes through the full conversion process, so it's eligible for preprocessing.

As far as preprocessing messing up your book formatting - I highly doubt it could have changed the order of the books contents - I don't see how this could even be possible. There's also no way it would insert next/previous.

That said, preprocess does look for potential chapter breaks in a fairly aggressive manner, and this can infrequently cause the book to be split in undesired places. It is also checking for line unwrap options and may change paragraphs, along with a variety of other things. While it's usually quite harmless and will generally improve a poorly formatted book, it can't be guaranteed to work for every single book. If you read the help/documentation it clearly states that preprocessing could make your book worse. This is the reason the default is disabled.

The other structure detection options that apply to epub and all other formats could also introduce new split points in the doc that aren't neccessarily desired. If it's doing something like that then you can tweak the 'insert page breaks before' option, or tweak the chapter detection xpath.

DoctorOhh · 01-05-2011, 11:19 AM

Quote:

Originally Posted by ldolse

As far as preprocessing messing up your book formatting - I highly doubt it could have changed the order of the books contents - I don't see how this could even be possible. There's also no way it would insert next/previous.

What I believe he said he did to mess up the order of his book was to do a ePub to zip conversion with unknown settings checked. Then he took the resultant zip file and converted it to epub with unknown settings checked.

Why? I haven't a clue.

cybmole · 01-05-2011, 11:27 AM

Quote:

Originally Posted by dwanthny

What I believe he said he did to mess up the order of his book was to do a ePub to zip conversion with unknown settings checked. Then he took the resultant zip file and converted it to epub with unknown settings checked.

Why? I haven't a clue.

why - because someone here suggested that double conversion, as a workaround for the preprocess.. tick box being ignored with epub sources.

EDIT: the thread is here
https://www.mobileread.com/forums/sho...d.php?t=114420
but reading it again, I see that I misunderstood the suggestion - which was actually to copy, rename & reimport the epub as zip. NOT to convert it to ZIP in calibre.
END EDIT

i've not done any more testing but I suspect the epub to zip CONVERSION generates extra .xthml pages with next - prev links.

if I go epub to zip, then zip to epub, then open the end result in sigil I see lots of .xhtml sheets that wee not there before, and each sheet looks like a frame header with next and prev buttons on it . that would not be so bad if the order did not get mangled at the same time?

i think that the messed up order has nothing to do with pre-processing options and is most likely happening as a side effect in the epub-zip stage of conversion.

try it & see for yourselves, all books should behave similarly.

kovidgoyal · 01-05-2011, 11:37 AM

the zip output plugin creates html that is suitable for a website, do not do an epub to zip conversion, rename your epub to .zip and convert that if you want to use preprocessing options.

Wolfgan · 01-11-2011, 10:42 AM

In converting a lit to mobi for my kindle3, the standard chapter detection XPath expression didn't detect anything (the source file is rather flat on that sense).

As the source file chapters seem to be just a # symbol starting a paragraph (ie #And then the beast run thru the forest... ), I created the following XPath expression:

Code:

//*[re:test(., '(?s)^#\w+','i')]

It works well on the detection side, what's puzzling me is that it detects the chapters twice, one after the other (as if the parser run twice).
Any clue? Thanks in advance, Wolf.

kovidgoyal · 01-11-2011, 10:44 AM

That means that the string is present in your input document twice.

Wolfgan · 01-11-2011, 12:33 PM

Quote:

Originally Posted by kovidgoyal

That means that the string is present in your input document twice.

Thanks kovid, but I checked that no text duplications exist in the html debug files to trigger the expression twice. What's weird is that /processed & /structure files show double toc entries, the first one empty.

/parsed

Code:

<p class="MsoPlainText"><span style="font-size:12.0pt;font-family:&quot;Trebuchet MS&quot;;">#Textbody...</span></p>

/structure

Code:

<hr/><p class="MsoPlainText" id="calibre_toc_2"><hr/><span style="font-size:12.0pt;font-family:&quot;Trebuchet MS&quot;;" id="calibre_toc_3">#Textbody...</span></p>

/processed

Code:

<p class="MsoPlainText1" id="calibre_toc_2"><hr class="calibre4"/><span id="calibre_toc_3" class="calibre3">#Textbody...</span></p>

Still without clue, I hope this helps. Thanks, Wolf.

01-05-2011, 06:47 AM	#1
cybmole Wizard Posts: 3,720 Karma: 1759970 Join Date: Sep 2010 Device: none	structure detection - documentation ? I am told in other threads that structure detection option is ignored for epub source. Is that true for any other source formats or epub only ? what's the recommended way to force structure detection on epub - is it convert to zip then back again ? is there / should there be any detailed documentation for the above + of what (processing operations) structure detection actually does. it seems like a tick-it-&-see black box thingie at present ?

01-05-2011, 08:44 AM	#3
cybmole Wizard Posts: 3,720 Karma: 1759970 Join Date: Sep 2010 Device: none	in was in a discussion about TOC & chapter detection. I wanted to automate having chapters flagged with h1 or h2 tags so that they would add to TOC. someone ( not you) said that structure detection is not applied if source format = epub because source is assumed to be "already good". it could be that they meant only "preprocess to imporve structure detection" box so a rephrase of my Q is: for what source formats is preprocess... tick box ignored ? I re-read the manual but it does not define exactly what takes place when the box labelled preprocessing is ticked. the manual gives an overview of what it's intended to do, but I was wanting a programmer's definiton of what logic is apllied & how. Last edited by cybmole; 01-05-2011 at 08:47 AM.

01-11-2011, 10:42 AM	#13
Wolfgan Avid reader Posts: 19 Karma: 10 Join Date: Feb 2009 Location: Argentina Device: Kindle 3 wifi	Double chapter detection In converting a lit to mobi for my kindle3, the standard chapter detection XPath expression didn't detect anything (the source file is rather flat on that sense). As the source file chapters seem to be just a # symbol starting a paragraph (ie #And then the beast run thru the forest... ), I created the following XPath expression: Code: //*[re:test(., '(?s)^#\w+','i')] It works well on the detection side, what's puzzling me is that it detects the chapters twice, one after the other (as if the parser run twice). Any clue? Thanks in advance, Wolf.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Structure Detection - Remove Header (or Footer) Regex	DarkKipper	Conversion	69	11-09-2013 12:21 PM
Trouble w structure detection	jeff47	Calibre	1	10-13-2010 12:51 AM
epub - force a 2nd pass to improve structure detection ?	cybmole	Calibre	10	10-08-2010 01:00 AM
Structure Detection Ceased To Exist?	radiofred	Calibre	3	10-01-2010 12:33 AM
Structure detection v5.5 and v6.2	AlexBell	Calibre	2	07-29-2009 10:11 PM

01-05-2011, 07:43 AM	#2
Manichean Wizard Posts: 3,130 Karma: 91256 Join Date: Feb 2008 Location: Germany Device: Cybook Gen3	What structure detection are you referring to? I've used header/footer removal, which is part of the structure detection conversion settings, successfully on ePub, so I don't think it's ignored. Preprocessing might be, though. As for documentation, as usual, refer to the manual.

01-05-2011, 09:00 AM	#4
Manichean Wizard Posts: 3,130 Karma: 91256 Join Date: Feb 2008 Location: Germany Device: Cybook Gen3	Ah, yes, I remember that discussion. Try asking for (or waiting for) whoever programmed the preprocessing engine to explain. As for a workaround, I believe that adding the ePub as a ZIP file and reconverting ought to work.

01-05-2011, 09:25 AM	#6
Manichean Wizard Posts: 3,130 Karma: 91256 Join Date: Feb 2008 Location: Germany Device: Cybook Gen3	I was thinking about exporting the ePub, renaming it as a ZIP, then reimporting it into the book and converting to ePub. You might want to store the ePub somewhere safe externally, as it is going to be overwritten. As for the chapter order, there's a FAQ entry relevant to file ordering in multi-HTML books. I should have a small script still floating around for creating the index file discussed in that FAQ entry that I hacked together once for the purpose of converting a HTML reference book, but it only cares about what the first file is and dumps the rest of the files in as it finds them. Still, should that be useful, I can post it here.

01-05-2011, 09:37 AM	#7
cybmole Wizard Posts: 3,720 Karma: 1759970 Join Date: Sep 2010 Device: none	seems to me that calibre should not allow conversion options that will lose or destroy chapter ordering - or should at least warn against them is there any e-book related need for calibre to offer an epub to zip conversion, or should should its epub to zip conversion option be disabled , along with any other combinations that do not preserve chapter ordering ? yes, I'd taken a backup of the epub but others may be caught out.

01-05-2011, 09:54 AM	#8
Manichean Wizard Posts: 3,130 Karma: 91256 Join Date: Feb 2008 Location: Germany Device: Cybook Gen3	It is called YAFIYGI, as opposed to WYSIWYG. Deal with it.

01-05-2011, 10:21 AM	#9
ldolse Wizard Posts: 1,337 Karma: 123455 Join Date: Apr 2009 Location: Malaysia Device: PRS-650, iPhone	Calibre won't insert previous/next navigation - that would have to be built into the original source. The only structure detection option that doesn't work on epub is the 'preprocess html to possibly improve structure detection'. This is due to the way the conversion pipeline works and that Calibre treats OEB as a sort of 'reference' format. All filetypes are converted from their native format to OEB internally, and then re-converted from OEB to whatever the desired output format is. Preprocessing occurs BEFORE conversion to OEB. Epub IS OEB, just in a zipped container. Therefore Calibre doesn't bother converting from epub to OEB, it just unzips it and goes from there. Because of this epub bypasses the preprocess stage of the conversion pipeline. If you want the preprocess option to work on an epub just rename the epub to .zip instead of .epub and add a the zip back to the same book record as zipped html. HTML goes through the full conversion process, so it's eligible for preprocessing. As far as preprocessing messing up your book formatting - I highly doubt it could have changed the order of the books contents - I don't see how this could even be possible. There's also no way it would insert next/previous. That said, preprocess does look for potential chapter breaks in a fairly aggressive manner, and this can infrequently cause the book to be split in undesired places. It is also checking for line unwrap options and may change paragraphs, along with a variety of other things. While it's usually quite harmless and will generally improve a poorly formatted book, it can't be guaranteed to work for every single book. If you read the help/documentation it clearly states that preprocessing could make your book worse. This is the reason the default is disabled. The other structure detection options that apply to epub and all other formats could also introduce new split points in the doc that aren't neccessarily desired. If it's doing something like that then you can tweak the 'insert page breaks before' option, or tweak the chapter detection xpath.

01-05-2011, 11:37 AM	#12
kovidgoyal creator of calibre Posts: 43,850 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	the zip output plugin creates html that is suitable for a website, do not do an epub to zip conversion, rename your epub to .zip and convert that if you want to use preprocessing options.

01-11-2011, 10:44 AM	#14
kovidgoyal creator of calibre Posts: 43,850 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	That means that the string is present in your input document twice.