Detect chapters without using tag or class.

tonyx3 · 09-14-2010, 01:41 AM

Disclaimer: I am in no way a regex or XPATH expert, so I could be completely wrong. Please don't completely flame me out.

Ok, so this got kinda long.
SUMMARY:

Is there a way to match chapters using only a regex?
If not, can we please have one?
What about testing chapter detection before conversion?

There's lots of threads on the forum with people asking for help getting their chapters to detect properly.

For me, at least, a lot of the trouble I have comes from trying to use calibre to convert poorly formatted books from one format into properly formatted books in another format.

For example, in an ideal situation, the chapter titles in the source format will all be in their own line, with proper opening and closing tags (preferably h1), and will include the word 'chapter.' If that happens, it's easy to get calibre to detect them and generate a proper Table of Contents for my target format.

In reality, though, I often find myself trying to convert books in which the chapter titles don't have 'chapter' in them, and often don't have any special or unique format tags. Basically, there's a lot of poorly formatted books out there. I won't even mention trying to convert from PDF.

Here's an actual recent example from a book I converted:

Source format: LRF
The chapter titles simply consist of the chapter number, spelled out in all caps, e.g. ONE, TWO, THREE, etc. (this isn't a huge problem, but it's annoying.

The tag isn't a header tag, its <span>, and it isn't closed out immediately after the chapter title, like this:

Code:

preceding text from Introduction<br/><span class="ts1">ONE<span class="ts2"><br/>Beginning text of chapter 1.

Now, that would be annoying, but workable if I could just use a regex to match the <span class="ts1"> just before the chapter title and the <span class="ts2"> just after it. The chapter titles are the only thing formatted that way.

But, unless I'm mistaken, I can't do that, because I need to use XPATH to match tags and classes for the chapters.

I don't know the best solution for that.

I guess I'm advocating for the addition of an option to only match a regex for chapter (in the same way that the header removal works). It's easy to write a regex to match most chapters. Much easier, for me at least, than trying to use the current chapter detection options.

Also an option to test the chapter detection before conversion would be great.

Often I try to use the test feature of the header removal to write my expression for chapters, but then it doesn't work when I use it, and I don't know that until the conversion is done.

I'm sure there is some way to write an amazing expression for the example I gave, but it's not as simple as matching with a regex (using the header removal test feature, this is easy)

ldolse · 09-14-2010, 01:47 AM

Try enabling the preprocess option. It does quite a bit in the current release, and there is additional logic coming in the next release.

DoctorOhh · 09-14-2010, 01:51 AM

Quote:

Originally Posted by ldolse

Try enabling the preprocess option. It does quite a bit in the current release, and there is additional logic coming in the next release.

I'm curious. What, if anything, should you enter in the preprocess area?

capidamonte · 09-14-2010, 02:12 AM

Where is the preprocess option? I've searched several times now, and I'm having one of those blind spot moments.

DoctorOhh · 09-14-2010, 02:14 AM

Quote:

Originally Posted by capidamonte

Where is the preprocess option. I've searched several times now, and I'm having one of those blind spot moments.

Where it is is easy, what it does I'll let others address.

Preferences - Conversion - Common Options - Structure detection

Or Convert books icon - Structure Detection (I have to scroll down)

tonyx3 · 09-14-2010, 02:56 AM

Quote:

Originally Posted by ldolse

Try enabling the preprocess option. It does quite a bit in the current release, and there is additional logic coming in the next release.

I tried enabling it. It didn't make any difference to the output.

Is there something I'm supposed to do here other than tick the little box?

I'm all for learning to work with the existing setup, so if there's some way to do it, let me know. I just haven't had much luck with the current options unless the source file is already semi-well formatted.

I inevitably give up and use calibre to convert to rtf, then format in Word, and then use calibre to convert that to epub (or whatever format).

But in a lot of cases the overall formatting is fine, it's just the chapters that are the problem.

I shouldn't have to go out to rtf and Word and back just to get chapters correct, when the rest of the formatting is fine.

Manichean · 09-14-2010, 04:12 AM

You could use Sigil to edit the ePub directly, that, at least, would spare you having to go to Word. Not much of an improvement, I'll admit.

As for your initial problem... you gave the example

Code:

preceding text from Introduction<br/><span class="ts1">ONE<span class="ts2"><br/>Beginning text of chapter 1.

for what you want to match. Generally, it should be possible to match tags in a regexp, though, of course, the regexp won't "understand" those tags other than as a string. In your example, the expression

Code:

<span class="ts1">[A-Z]+<span class="ts2"><br/>

should match the chapter headings.

EDIT: Oh, I'm sorry, I misread your post. I'll have to put the thinking cap back on.

Manichean · 09-14-2010, 04:25 AM

I'm not quite sure how to go on about using the whole book as source in the re.test() function, but shouldn't

Code:

re.test(<whole book as source>, <span class="ts1">[A-Z]+<span class="ts2"><br/>, )

just use the regexp for chapter matching?
Another thing to try would be

Code:

/h:span[@class="ts1"]re.test(.,[A-Z]+,)/h:span[@class="ts2"]

Disclaimer: I didn't try the above myself. It might be horribly wrong...

tonyx3 · 09-14-2010, 05:07 AM

Quote:

Originally Posted by Manichean

I'm not quite sure how to go on about using the whole book as source in the re.test() function, but shouldn't

Code:

re.test(<whole book as source>, <span class="ts1">[A-Z]+<span class="ts2"><br/>, )

just use the regexp for chapter matching?
Another thing to try would be

Code:

/h:span[@class="ts1"]re.test(.,[A-Z]+,)/h:span[@class="ts2"]

Disclaimer: I didn't try the above myself. It might be horribly wrong...

Hrm.. I just tried that second one, but calibre said it was an invalid XPATH expression. I'm not sure but I think maybe you can't use /h:span twice.. ?

capidamonte · 09-14-2010, 06:09 AM

Quote:

Originally Posted by dwanthny

Where it is is easy, what it does I'll let others address.

Preferences - Conversion - Common Options - Structure detection

Or Convert books icon - Structure Detection (I have to scroll down)

Thanks! That is a non-obvious place.

Manichean · 09-14-2010, 07:36 AM

Quote:

Originally Posted by tonyx3

Hrm.. I just tried that second one, but calibre said it was an invalid XPATH expression. I'm not sure but I think maybe you can't use /h:span twice.. ?

I don't know, I've never used XPath. I just skimmed over the tutorial in the manual and tried to guess what could work, after I realized that a regexp alone wouldn't work...

ldolse · 09-14-2010, 09:22 AM

I just noticed your source format is LRF, that's not hooked into the preprocess option at the moment, partially because I don't really have any lrf files in my library, and partially because I wasn't aware that it was a format that had problems like this.

Right now preprocessing works for html, Lit, txt, and rtf inputs. One option would be to specify a directory for debug output and grab the first pass 'parsed' html output and re-convert that.

I wouldn't recommend enabling it globally in the preferences section, do it on a book by book basis. Overall it's pretty conservative and won't do much to a well marked up file, the only really destructive thing it will do across all files is remove all non-breaking spaces.

As far as what preprocessing does, I don't quite remember what's in .7.18, there are a bunch of changes going in the next release. I think .7.18 has basic chapter detection and line unwrapping. .7.18 worked pretty well on txt, rtf, and some types of lit files, but I've tested with a larger range of crappy files now, so the new code is doing better.

Right now it attempts to:

Convert non-breaking space indents to css indents
Removes remaining nonbreaking spaces (the most destructive thing it does right now)
Checks the file to see if there are blank lines inserted between every paragraph and deletes them if that's the case (second most destructive thing, need to improve this to preserve soft breaks if they exist)
Adds markup to lit files which are actually glorified text in <pre> tags and a lit wrapper
Tries up to four different regexes for chapter/chapter title detection, trying the ones with fewest false positives first, marks them in h2/h3 tags.
Unwraps hard line breaks based on the median line length and punctuation
Removes/unwraps soft hyphens, unwraps other hyphens
Searches for places where h1 or h2 headers immediately follow each other from one line to the next, which will cause Calibre to split on those points, changes the second header to h3. This prevents chapter headings and titles/images from being separated. (this particular step is also applied to mobi files)

I've tested this across a couple dozen garbage lit files and a bunch of html, txt, and rtf files. Getting fairly good results at this point, but the line unwrapping could use some more work. It works best when all the hard line breaks are pretty much in the same place, but if the lengths are variable then line unwrapping might not work. I need to add a user configurable unwrap_factor like pdf to resolve that problem. It has other problems similar to pdf where lines aren't always unwrapped to avoid false positives - will be looking into cases where there is spacing between paragraphs or indents to make this a bit smarter.

Anyway the idea isn't to be perfect, it's just to make it so that as few hand edits as possible are required after conversion.

ldolse · 09-14-2010, 09:31 AM

Quote:

Originally Posted by dwanthny

I'm curious. What, if anything, should you enter in the preprocess area?

Nothing to enter in the area, just enable the checkbox. Right now nothing is user configurable.

BTW, just found a bunch of lrf files on my system I didn't know about, looking into them now. At first glance it looks like some modifications to the regexes for chapter detection will be needed. I don't think I accounted for the headings to be nested in so many div/span tags, haven't seen that with the other formats.

I'm curious though, for LRF samples I have the sources all look good, and the files are already nicely split per chapter, is this not the case with yours? The only problem I've seen is that in one case a TOC wasn't automatically generated during conversion, but look and feel was ok even in this case.

DoctorOhh · 09-14-2010, 10:53 AM

Quote:

Originally Posted by ldolse

Nothing to enter in the area, just enable the checkbox. Right now nothing is user configurable.

Maybe I'm looking in the wrong spot but the one under structure detection has a editable area.

Quote:

Originally Posted by ldolse

I'm curious though, for LRF samples I have the sources all look good, and the files are already nicely split per chapter, is this not the case with yours? The only problem I've seen is that in one case a TOC wasn't automatically generated during conversion, but look and feel was ok even in this case.

Just like with every other format if garbage was the source garbage is what they ended up with.

tonyx3 · 09-14-2010, 10:57 AM

Quote:

Originally Posted by dwanthny

Just like with every other format if garbage was the source garbage is what they ended up with.

Exactly.

For this particular issue, if the body text is ok, then having more options for chapter detection would solve a ton of conversion problems.

Having it be dependent on the source having proper tags adds an extra layer of trouble with improperly formatted sources.

09-14-2010, 01:41 AM	#1
tonyx3 Connoisseur Posts: 55 Karma: 10 Join Date: Jan 2010 Device: Nexus One	Detect chapters without using tag or class. *Disclaimer: I am in no way a regex or XPATH expert, so I could be completely wrong. Please don't completely flame me out.* Ok, so this got kinda long. SUMMARY: Is there a way to match chapters using only a regex? If not, can we please have one? What about testing chapter detection before conversion? There's lots of threads on the forum with people asking for help getting their chapters to detect properly. For me, at least, a lot of the trouble I have comes from trying to use calibre to convert poorly formatted books from one format into properly formatted books in another format. For example, in an ideal situation, the chapter titles in the source format will all be in their own line, with proper opening and closing tags (preferably h1), and will include the word 'chapter.' If that happens, it's easy to get calibre to detect them and generate a proper Table of Contents for my target format. In reality, though, I often find myself trying to convert books in which the chapter titles don't have 'chapter' in them, and often don't have any special or unique format tags. Basically, there's a lot of poorly formatted books out there. I won't even mention trying to convert from PDF. Here's an actual recent example from a book I converted: Source format: LRF The chapter titles simply consist of the chapter number, spelled out in all caps, e.g. ONE, TWO, THREE, etc. (this isn't a huge problem, but it's annoying. The tag isn't a header tag, its <span>, and it isn't closed out immediately after the chapter title, like this: Code: preceding text from Introduction<br/><span class="ts1">ONE<span class="ts2"><br/>Beginning text of chapter 1. Now, that would be annoying, but workable if I could just use a regex to match the <span class="ts1"> just before the chapter title and the <span class="ts2"> just after it. The chapter titles are the only thing formatted that way. But, unless I'm mistaken, I can't do that, because I need to use XPATH to match tags and classes for the chapters. I don't know the best solution for that. I guess I'm advocating for the addition of an option to only match a regex for chapter (in the same way that the header removal works). It's easy to write a regex to match most chapters. Much easier, for me at least, than trying to use the current chapter detection options. Also an option to test the chapter detection before conversion would be great. Often I try to use the test feature of the header removal to write my expression for chapters, but then it doesn't work when I use it, and I don't know that until the conversion is done. I'm sure there is some way to write an amazing expression for the example I gave, but it's not as simple as matching with a regex (using the header removal test feature, this is easy)

09-14-2010, 04:12 AM	#7
Manichean Wizard Posts: 3,130 Karma: 91256 Join Date: Feb 2008 Location: Germany Device: Cybook Gen3	You could use Sigil to edit the ePub directly, that, at least, would spare you having to go to Word. Not much of an improvement, I'll admit. As for your initial problem... you gave the example Code: preceding text from Introduction<br/><span class="ts1">ONE<span class="ts2"><br/>Beginning text of chapter 1. for what you want to match. Generally, it should be possible to match tags in a regexp, though, of course, the regexp won't "understand" those tags other than as a string. In your example, the expression Code: <span class="ts1">[A-Z]+<span class="ts2"><br/> should match the chapter headings. EDIT: Oh, I'm sorry, I misread your post. I'll have to put the thinking cap back on. Last edited by Manichean; 09-14-2010 at 04:15 AM.

09-14-2010, 04:25 AM	#8
Manichean Wizard Posts: 3,130 Karma: 91256 Join Date: Feb 2008 Location: Germany Device: Cybook Gen3	I'm not quite sure how to go on about using the whole book as source in the re.test() function, but shouldn't Code: re.test(<whole book as source>, <span class="ts1">[A-Z]+<span class="ts2"><br/>, ) just use the regexp for chapter matching? Another thing to try would be Code: /h:span[@class="ts1"]re.test(.,[A-Z]+,)/h:span[@class="ts2"] Disclaimer: I didn't try the above myself. It might be horribly wrong...

09-14-2010, 09:22 AM	#12
ldolse Wizard Posts: 1,337 Karma: 123455 Join Date: Apr 2009 Location: Malaysia Device: PRS-650, iPhone	I just noticed your source format is LRF, that's not hooked into the preprocess option at the moment, partially because I don't really have any lrf files in my library, and partially because I wasn't aware that it was a format that had problems like this. Right now preprocessing works for html, Lit, txt, and rtf inputs. One option would be to specify a directory for debug output and grab the first pass 'parsed' html output and re-convert that. I wouldn't recommend enabling it globally in the preferences section, do it on a book by book basis. Overall it's pretty conservative and won't do much to a well marked up file, the only really destructive thing it will do across all files is remove all non-breaking spaces. As far as what preprocessing does, I don't quite remember what's in .7.18, there are a bunch of changes going in the next release. I think .7.18 has basic chapter detection and line unwrapping. .7.18 worked pretty well on txt, rtf, and some types of lit files, but I've tested with a larger range of crappy files now, so the new code is doing better. Right now it attempts to: Convert non-breaking space indents to css indents Removes remaining nonbreaking spaces (the most destructive thing it does right now) Checks the file to see if there are blank lines inserted between every paragraph and deletes them if that's the case (second most destructive thing, need to improve this to preserve soft breaks if they exist) Adds markup to lit files which are actually glorified text in <pre> tags and a lit wrapper Tries up to four different regexes for chapter/chapter title detection, trying the ones with fewest false positives first, marks them in h2/h3 tags. Unwraps hard line breaks based on the median line length and punctuation Removes/unwraps soft hyphens, unwraps other hyphens Searches for places where h1 or h2 headers immediately follow each other from one line to the next, which will cause Calibre to split on those points, changes the second header to h3. This prevents chapter headings and titles/images from being separated. (this particular step is also applied to mobi files) I've tested this across a couple dozen garbage lit files and a bunch of html, txt, and rtf files. Getting fairly good results at this point, but the line unwrapping could use some more work. It works best when all the hard line breaks are pretty much in the same place, but if the lengths are variable then line unwrapping might not work. I need to add a user configurable unwrap_factor like pdf to resolve that problem. It has other problems similar to pdf where lines aren't always unwrapped to avoid false positives - will be looking into cases where there is spacing between paragraphs or indents to make this a bit smarter. Anyway the idea isn't to be perfect, it's just to make it so that as few hand edits as possible are required after conversion. Last edited by ldolse; 09-14-2010 at 09:24 AM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Can't detect Cybook Gen 3	minca	Calibre	4	08-09-2010 08:50 AM
SD Class support	drdman	Astak EZReader	6	10-30-2009 12:42 AM
ePub Chapters vs. Stanza Chapters	kjk	Sigil	4	09-14-2009 10:50 AM
What do need to detect a Kindle 2?	TallMomof2	Calibre	3	02-24-2009 05:00 PM
TeX class	nsg	Sony Reader	3	11-05-2007 07:58 PM

09-14-2010, 01:47 AM	#2
ldolse Wizard Posts: 1,337 Karma: 123455 Join Date: Apr 2009 Location: Malaysia Device: PRS-650, iPhone	Try enabling the preprocess option. It does quite a bit in the current release, and there is additional logic coming in the next release.

09-14-2010, 02:12 AM	#4
capidamonte Not who you think I am... Posts: 374 Karma: 30283 Join Date: Jan 2010 Location: Honolulu Device: PocketBook 360 -- Ivory	Where is the preprocess option? I've searched several times now, and I'm having one of those blind spot moments.