09-14-2010, 01:41 AM | #1 |
Connoisseur
Posts: 55
Karma: 10
Join Date: Jan 2010
Device: Nexus One
|
Detect chapters without using tag or class.
Disclaimer: I am in no way a regex or XPATH expert, so I could be completely wrong. Please don't completely flame me out.
Ok, so this got kinda long. SUMMARY:
There's lots of threads on the forum with people asking for help getting their chapters to detect properly. For me, at least, a lot of the trouble I have comes from trying to use calibre to convert poorly formatted books from one format into properly formatted books in another format. For example, in an ideal situation, the chapter titles in the source format will all be in their own line, with proper opening and closing tags (preferably h1), and will include the word 'chapter.' If that happens, it's easy to get calibre to detect them and generate a proper Table of Contents for my target format. In reality, though, I often find myself trying to convert books in which the chapter titles don't have 'chapter' in them, and often don't have any special or unique format tags. Basically, there's a lot of poorly formatted books out there. I won't even mention trying to convert from PDF. Here's an actual recent example from a book I converted: Source format: LRF The chapter titles simply consist of the chapter number, spelled out in all caps, e.g. ONE, TWO, THREE, etc. (this isn't a huge problem, but it's annoying. The tag isn't a header tag, its <span>, and it isn't closed out immediately after the chapter title, like this: Code:
preceding text from Introduction<br/><span class="ts1">ONE<span class="ts2"><br/>Beginning text of chapter 1. But, unless I'm mistaken, I can't do that, because I need to use XPATH to match tags and classes for the chapters. I don't know the best solution for that. I guess I'm advocating for the addition of an option to only match a regex for chapter (in the same way that the header removal works). It's easy to write a regex to match most chapters. Much easier, for me at least, than trying to use the current chapter detection options. Also an option to test the chapter detection before conversion would be great. Often I try to use the test feature of the header removal to write my expression for chapters, but then it doesn't work when I use it, and I don't know that until the conversion is done. I'm sure there is some way to write an amazing expression for the example I gave, but it's not as simple as matching with a regex (using the header removal test feature, this is easy) |
09-14-2010, 01:47 AM | #2 |
Wizard
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
Try enabling the preprocess option. It does quite a bit in the current release, and there is additional logic coming in the next release.
|
Advert | |
|
09-14-2010, 01:51 AM | #3 |
US Navy, Retired
Posts: 9,864
Karma: 13806776
Join Date: Feb 2009
Location: North Carolina
Device: Icarus Illumina XL HD, Nexus 7
|
I'm curious. What, if anything, should you enter in the preprocess area?
Last edited by DoctorOhh; 09-14-2010 at 01:54 AM. |
09-14-2010, 02:12 AM | #4 |
Not who you think I am...
Posts: 374
Karma: 30283
Join Date: Jan 2010
Location: Honolulu
Device: PocketBook 360 -- Ivory
|
Where is the preprocess option? I've searched several times now, and I'm having one of those blind spot moments.
|
09-14-2010, 02:14 AM | #5 | |
US Navy, Retired
Posts: 9,864
Karma: 13806776
Join Date: Feb 2009
Location: North Carolina
Device: Icarus Illumina XL HD, Nexus 7
|
Quote:
Preferences - Conversion - Common Options - Structure detection Or Convert books icon - Structure Detection (I have to scroll down) |
|
Advert | |
|
09-14-2010, 02:56 AM | #6 | |
Connoisseur
Posts: 55
Karma: 10
Join Date: Jan 2010
Device: Nexus One
|
Quote:
Is there something I'm supposed to do here other than tick the little box? I'm all for learning to work with the existing setup, so if there's some way to do it, let me know. I just haven't had much luck with the current options unless the source file is already semi-well formatted. I inevitably give up and use calibre to convert to rtf, then format in Word, and then use calibre to convert that to epub (or whatever format). But in a lot of cases the overall formatting is fine, it's just the chapters that are the problem. I shouldn't have to go out to rtf and Word and back just to get chapters correct, when the rest of the formatting is fine. |
|
09-14-2010, 04:12 AM | #7 |
Wizard
Posts: 3,130
Karma: 91256
Join Date: Feb 2008
Location: Germany
Device: Cybook Gen3
|
You could use Sigil to edit the ePub directly, that, at least, would spare you having to go to Word. Not much of an improvement, I'll admit.
As for your initial problem... you gave the example Code:
preceding text from Introduction<br/><span class="ts1">ONE<span class="ts2"><br/>Beginning text of chapter 1. Code:
<span class="ts1">[A-Z]+<span class="ts2"><br/> EDIT: Oh, I'm sorry, I misread your post. I'll have to put the thinking cap back on. Last edited by Manichean; 09-14-2010 at 04:15 AM. |
09-14-2010, 04:25 AM | #8 |
Wizard
Posts: 3,130
Karma: 91256
Join Date: Feb 2008
Location: Germany
Device: Cybook Gen3
|
I'm not quite sure how to go on about using the whole book as source in the re.test() function, but shouldn't
Code:
re.test(<whole book as source>, <span class="ts1">[A-Z]+<span class="ts2"><br/>, ) Another thing to try would be Code:
/h:span[@class="ts1"]re.test(.,[A-Z]+,)/h:span[@class="ts2"] |
09-14-2010, 05:07 AM | #9 | |
Connoisseur
Posts: 55
Karma: 10
Join Date: Jan 2010
Device: Nexus One
|
Quote:
Hrm.. I just tried that second one, but calibre said it was an invalid XPATH expression. I'm not sure but I think maybe you can't use /h:span twice.. ? |
|
09-14-2010, 06:09 AM | #10 |
Not who you think I am...
Posts: 374
Karma: 30283
Join Date: Jan 2010
Location: Honolulu
Device: PocketBook 360 -- Ivory
|
|
09-14-2010, 07:36 AM | #11 |
Wizard
Posts: 3,130
Karma: 91256
Join Date: Feb 2008
Location: Germany
Device: Cybook Gen3
|
I don't know, I've never used XPath. I just skimmed over the tutorial in the manual and tried to guess what could work, after I realized that a regexp alone wouldn't work...
|
09-14-2010, 09:22 AM | #12 |
Wizard
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
I just noticed your source format is LRF, that's not hooked into the preprocess option at the moment, partially because I don't really have any lrf files in my library, and partially because I wasn't aware that it was a format that had problems like this.
Right now preprocessing works for html, Lit, txt, and rtf inputs. One option would be to specify a directory for debug output and grab the first pass 'parsed' html output and re-convert that. I wouldn't recommend enabling it globally in the preferences section, do it on a book by book basis. Overall it's pretty conservative and won't do much to a well marked up file, the only really destructive thing it will do across all files is remove all non-breaking spaces. As far as what preprocessing does, I don't quite remember what's in .7.18, there are a bunch of changes going in the next release. I think .7.18 has basic chapter detection and line unwrapping. .7.18 worked pretty well on txt, rtf, and some types of lit files, but I've tested with a larger range of crappy files now, so the new code is doing better. Right now it attempts to:
I've tested this across a couple dozen garbage lit files and a bunch of html, txt, and rtf files. Getting fairly good results at this point, but the line unwrapping could use some more work. It works best when all the hard line breaks are pretty much in the same place, but if the lengths are variable then line unwrapping might not work. I need to add a user configurable unwrap_factor like pdf to resolve that problem. It has other problems similar to pdf where lines aren't always unwrapped to avoid false positives - will be looking into cases where there is spacing between paragraphs or indents to make this a bit smarter. Anyway the idea isn't to be perfect, it's just to make it so that as few hand edits as possible are required after conversion. Last edited by ldolse; 09-14-2010 at 09:24 AM. |
09-14-2010, 09:31 AM | #13 | |
Wizard
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
Quote:
BTW, just found a bunch of lrf files on my system I didn't know about, looking into them now. At first glance it looks like some modifications to the regexes for chapter detection will be needed. I don't think I accounted for the headings to be nested in so many div/span tags, haven't seen that with the other formats. I'm curious though, for LRF samples I have the sources all look good, and the files are already nicely split per chapter, is this not the case with yours? The only problem I've seen is that in one case a TOC wasn't automatically generated during conversion, but look and feel was ok even in this case. Last edited by ldolse; 09-14-2010 at 10:24 AM. |
|
09-14-2010, 10:53 AM | #14 | ||
US Navy, Retired
Posts: 9,864
Karma: 13806776
Join Date: Feb 2009
Location: North Carolina
Device: Icarus Illumina XL HD, Nexus 7
|
Quote:
Quote:
Last edited by DoctorOhh; 09-14-2010 at 10:56 AM. |
||
09-14-2010, 10:57 AM | #15 | |
Connoisseur
Posts: 55
Karma: 10
Join Date: Jan 2010
Device: Nexus One
|
Quote:
For this particular issue, if the body text is ok, then having more options for chapter detection would solve a ton of conversion problems. Having it be dependent on the source having proper tags adds an extra layer of trouble with improperly formatted sources. |
|
Tags |
chapter, regex |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Can't detect Cybook Gen 3 | minca | Calibre | 4 | 08-09-2010 08:50 AM |
SD Class support | drdman | Astak EZReader | 6 | 10-30-2009 12:42 AM |
ePub Chapters vs. Stanza Chapters | kjk | Sigil | 4 | 09-14-2009 10:50 AM |
What do need to detect a Kindle 2? | TallMomof2 | Calibre | 3 | 02-24-2009 05:00 PM |
TeX class | nsg | Sony Reader | 3 | 11-05-2007 07:58 PM |