01-05-2011, 06:47 AM | #1 |
Wizard
Posts: 3,720
Karma: 1759970
Join Date: Sep 2010
Device: none
|
structure detection - documentation ?
I am told in other threads that structure detection option is ignored for epub source. Is that true for any other source formats or epub only ?
what's the recommended way to force structure detection on epub - is it convert to zip then back again ? is there / should there be any detailed documentation for the above + of what (processing operations) structure detection actually does. it seems like a tick-it-&-see black box thingie at present ? |
01-05-2011, 07:43 AM | #2 |
Wizard
Posts: 3,130
Karma: 91256
Join Date: Feb 2008
Location: Germany
Device: Cybook Gen3
|
What structure detection are you referring to? I've used header/footer removal, which is part of the structure detection conversion settings, successfully on ePub, so I don't think it's ignored. Preprocessing might be, though.
As for documentation, as usual, refer to the manual. |
01-05-2011, 08:44 AM | #3 |
Wizard
Posts: 3,720
Karma: 1759970
Join Date: Sep 2010
Device: none
|
in was in a discussion about TOC & chapter detection. I wanted to automate having chapters flagged with h1 or h2 tags so that they would add to TOC.
someone ( not you) said that structure detection is not applied if source format = epub because source is assumed to be "already good". it could be that they meant only "preprocess to imporve structure detection" box so a rephrase of my Q is: for what source formats is preprocess... tick box ignored ? I re-read the manual but it does not define exactly what takes place when the box labelled preprocessing is ticked. the manual gives an overview of what it's intended to do, but I was wanting a programmer's definiton of what logic is apllied & how. Last edited by cybmole; 01-05-2011 at 08:47 AM. |
01-05-2011, 09:00 AM | #4 |
Wizard
Posts: 3,130
Karma: 91256
Join Date: Feb 2008
Location: Germany
Device: Cybook Gen3
|
Ah, yes, I remember that discussion. Try asking for (or waiting for) whoever programmed the preprocessing engine to explain.
As for a workaround, I believe that adding the ePub as a ZIP file and reconverting ought to work. |
01-05-2011, 09:14 AM | #5 | |
Wizard
Posts: 3,720
Karma: 1759970
Join Date: Sep 2010
Device: none
|
Quote:
my input book was perfectly ordered, but my output from the test began at chapter 19 ! maybe copying the epub, renaming it as zip then adding it to the library is safe, but converting from epub to zip seems very flawed, unless I am misunderstanding what the conversion is meant to achieve. |
|
01-05-2011, 09:25 AM | #6 |
Wizard
Posts: 3,130
Karma: 91256
Join Date: Feb 2008
Location: Germany
Device: Cybook Gen3
|
I was thinking about exporting the ePub, renaming it as a ZIP, then reimporting it into the book and converting to ePub. You might want to store the ePub somewhere safe externally, as it is going to be overwritten.
As for the chapter order, there's a FAQ entry relevant to file ordering in multi-HTML books. I should have a small script still floating around for creating the index file discussed in that FAQ entry that I hacked together once for the purpose of converting a HTML reference book, but it only cares about what the first file is and dumps the rest of the files in as it finds them. Still, should that be useful, I can post it here. |
01-05-2011, 09:37 AM | #7 |
Wizard
Posts: 3,720
Karma: 1759970
Join Date: Sep 2010
Device: none
|
seems to me that calibre should not allow conversion options that will lose or destroy chapter ordering - or should at least warn against them
is there any e-book related need for calibre to offer an epub to zip conversion, or should should its epub to zip conversion option be disabled , along with any other combinations that do not preserve chapter ordering ? yes, I'd taken a backup of the epub but others may be caught out. |
01-05-2011, 09:54 AM | #8 |
Wizard
Posts: 3,130
Karma: 91256
Join Date: Feb 2008
Location: Germany
Device: Cybook Gen3
|
It is called YAFIYGI, as opposed to WYSIWYG. Deal with it.
|
01-05-2011, 10:21 AM | #9 |
Wizard
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
Calibre won't insert previous/next navigation - that would have to be built into the original source.
The only structure detection option that doesn't work on epub is the 'preprocess html to possibly improve structure detection'. This is due to the way the conversion pipeline works and that Calibre treats OEB as a sort of 'reference' format. All filetypes are converted from their native format to OEB internally, and then re-converted from OEB to whatever the desired output format is. Preprocessing occurs BEFORE conversion to OEB. Epub IS OEB, just in a zipped container. Therefore Calibre doesn't bother converting from epub to OEB, it just unzips it and goes from there. Because of this epub bypasses the preprocess stage of the conversion pipeline. If you want the preprocess option to work on an epub just rename the epub to .zip instead of .epub and add a the zip back to the same book record as zipped html. HTML goes through the full conversion process, so it's eligible for preprocessing. As far as preprocessing messing up your book formatting - I highly doubt it could have changed the order of the books contents - I don't see how this could even be possible. There's also no way it would insert next/previous. That said, preprocess does look for potential chapter breaks in a fairly aggressive manner, and this can infrequently cause the book to be split in undesired places. It is also checking for line unwrap options and may change paragraphs, along with a variety of other things. While it's usually quite harmless and will generally improve a poorly formatted book, it can't be guaranteed to work for every single book. If you read the help/documentation it clearly states that preprocessing could make your book worse. This is the reason the default is disabled. The other structure detection options that apply to epub and all other formats could also introduce new split points in the doc that aren't neccessarily desired. If it's doing something like that then you can tweak the 'insert page breaks before' option, or tweak the chapter detection xpath. |
01-05-2011, 11:19 AM | #10 | |
US Navy, Retired
Posts: 9,864
Karma: 13806776
Join Date: Feb 2009
Location: North Carolina
Device: Icarus Illumina XL HD, Nexus 7
|
Quote:
Why? I haven't a clue. |
|
01-05-2011, 11:27 AM | #11 | |
Wizard
Posts: 3,720
Karma: 1759970
Join Date: Sep 2010
Device: none
|
Quote:
EDIT: the thread is here https://www.mobileread.com/forums/sho...d.php?t=114420 but reading it again, I see that I misunderstood the suggestion - which was actually to copy, rename & reimport the epub as zip. NOT to convert it to ZIP in calibre. END EDIT i've not done any more testing but I suspect the epub to zip CONVERSION generates extra .xthml pages with next - prev links. if I go epub to zip, then zip to epub, then open the end result in sigil I see lots of .xhtml sheets that wee not there before, and each sheet looks like a frame header with next and prev buttons on it . that would not be so bad if the order did not get mangled at the same time? i think that the messed up order has nothing to do with pre-processing options and is most likely happening as a side effect in the epub-zip stage of conversion. try it & see for yourselves, all books should behave similarly. Last edited by cybmole; 01-05-2011 at 11:36 AM. |
|
01-05-2011, 11:37 AM | #12 |
creator of calibre
Posts: 43,850
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
the zip output plugin creates html that is suitable for a website, do not do an epub to zip conversion, rename your epub to .zip and convert that if you want to use preprocessing options.
|
01-11-2011, 10:42 AM | #13 |
Avid reader
Posts: 19
Karma: 10
Join Date: Feb 2009
Location: Argentina
Device: Kindle 3 wifi
|
Double chapter detection
In converting a lit to mobi for my kindle3, the standard chapter detection XPath expression didn't detect anything (the source file is rather flat on that sense).
As the source file chapters seem to be just a # symbol starting a paragraph (ie #And then the beast run thru the forest... ), I created the following XPath expression: Code:
//*[re:test(., '(?s)^#\w+','i')] Any clue? Thanks in advance, Wolf. |
01-11-2011, 10:44 AM | #14 |
creator of calibre
Posts: 43,850
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
That means that the string is present in your input document twice.
|
01-11-2011, 12:33 PM | #15 | |
Avid reader
Posts: 19
Karma: 10
Join Date: Feb 2009
Location: Argentina
Device: Kindle 3 wifi
|
Quote:
/parsed Code:
<p class="MsoPlainText"><span style="font-size:12.0pt;font-family:"Trebuchet MS";">#Textbody...</span></p> Code:
<hr/><p class="MsoPlainText" id="calibre_toc_2"><hr/><span style="font-size:12.0pt;font-family:"Trebuchet MS";" id="calibre_toc_3">#Textbody...</span></p> Code:
<p class="MsoPlainText1" id="calibre_toc_2"><hr class="calibre4"/><span id="calibre_toc_3" class="calibre3">#Textbody...</span></p> |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Structure Detection - Remove Header (or Footer) Regex | DarkKipper | Conversion | 69 | 11-09-2013 12:21 PM |
Trouble w structure detection | jeff47 | Calibre | 1 | 10-13-2010 12:51 AM |
epub - force a 2nd pass to improve structure detection ? | cybmole | Calibre | 10 | 10-08-2010 01:00 AM |
Structure Detection Ceased To Exist? | radiofred | Calibre | 3 | 10-01-2010 12:33 AM |
Structure detection v5.5 and v6.2 | AlexBell | Calibre | 2 | 07-29-2009 10:11 PM |