MobileRead Forums - View Single Post - Detect chapters without using tag or class.

ldolse · 09-14-2010, 09:22 AM

I just noticed your source format is LRF, that's not hooked into the preprocess option at the moment, partially because I don't really have any lrf files in my library, and partially because I wasn't aware that it was a format that had problems like this.

Right now preprocessing works for html, Lit, txt, and rtf inputs. One option would be to specify a directory for debug output and grab the first pass 'parsed' html output and re-convert that.

I wouldn't recommend enabling it globally in the preferences section, do it on a book by book basis. Overall it's pretty conservative and won't do much to a well marked up file, the only really destructive thing it will do across all files is remove all non-breaking spaces.

As far as what preprocessing does, I don't quite remember what's in .7.18, there are a bunch of changes going in the next release. I think .7.18 has basic chapter detection and line unwrapping. .7.18 worked pretty well on txt, rtf, and some types of lit files, but I've tested with a larger range of crappy files now, so the new code is doing better.

Right now it attempts to:

Convert non-breaking space indents to css indents
Removes remaining nonbreaking spaces (the most destructive thing it does right now)
Checks the file to see if there are blank lines inserted between every paragraph and deletes them if that's the case (second most destructive thing, need to improve this to preserve soft breaks if they exist)
Adds markup to lit files which are actually glorified text in <pre> tags and a lit wrapper
Tries up to four different regexes for chapter/chapter title detection, trying the ones with fewest false positives first, marks them in h2/h3 tags.
Unwraps hard line breaks based on the median line length and punctuation
Removes/unwraps soft hyphens, unwraps other hyphens
Searches for places where h1 or h2 headers immediately follow each other from one line to the next, which will cause Calibre to split on those points, changes the second header to h3. This prevents chapter headings and titles/images from being separated. (this particular step is also applied to mobi files)

I've tested this across a couple dozen garbage lit files and a bunch of html, txt, and rtf files. Getting fairly good results at this point, but the line unwrapping could use some more work. It works best when all the hard line breaks are pretty much in the same place, but if the lengths are variable then line unwrapping might not work. I need to add a user configurable unwrap_factor like pdf to resolve that problem. It has other problems similar to pdf where lines aren't always unwrapped to avoid false positives - will be looking into cases where there is spacing between paragraphs or indents to make this a bit smarter.

Anyway the idea isn't to be perfect, it's just to make it so that as few hand edits as possible are required after conversion.

09-14-2010, 09:22 AM	#12
ldolse Wizard Posts: 1,337 Karma: 123455 Join Date: Apr 2009 Location: Malaysia Device: PRS-650, iPhone	I just noticed your source format is LRF, that's not hooked into the preprocess option at the moment, partially because I don't really have any lrf files in my library, and partially because I wasn't aware that it was a format that had problems like this. Right now preprocessing works for html, Lit, txt, and rtf inputs. One option would be to specify a directory for debug output and grab the first pass 'parsed' html output and re-convert that. I wouldn't recommend enabling it globally in the preferences section, do it on a book by book basis. Overall it's pretty conservative and won't do much to a well marked up file, the only really destructive thing it will do across all files is remove all non-breaking spaces. As far as what preprocessing does, I don't quite remember what's in .7.18, there are a bunch of changes going in the next release. I think .7.18 has basic chapter detection and line unwrapping. .7.18 worked pretty well on txt, rtf, and some types of lit files, but I've tested with a larger range of crappy files now, so the new code is doing better. Right now it attempts to: Convert non-breaking space indents to css indents Removes remaining nonbreaking spaces (the most destructive thing it does right now) Checks the file to see if there are blank lines inserted between every paragraph and deletes them if that's the case (second most destructive thing, need to improve this to preserve soft breaks if they exist) Adds markup to lit files which are actually glorified text in <pre> tags and a lit wrapper Tries up to four different regexes for chapter/chapter title detection, trying the ones with fewest false positives first, marks them in h2/h3 tags. Unwraps hard line breaks based on the median line length and punctuation Removes/unwraps soft hyphens, unwraps other hyphens Searches for places where h1 or h2 headers immediately follow each other from one line to the next, which will cause Calibre to split on those points, changes the second header to h3. This prevents chapter headings and titles/images from being separated. (this particular step is also applied to mobi files) I've tested this across a couple dozen garbage lit files and a bunch of html, txt, and rtf files. Getting fairly good results at this point, but the line unwrapping could use some more work. It works best when all the hard line breaks are pretty much in the same place, but if the lengths are variable then line unwrapping might not work. I need to add a user configurable unwrap_factor like pdf to resolve that problem. It has other problems similar to pdf where lines aren't always unwrapped to avoid false positives - will be looking into cases where there is spacing between paragraphs or indents to make this a bit smarter. Anyway the idea isn't to be perfect, it's just to make it so that as few hand edits as possible are required after conversion. Last edited by ldolse; 09-14-2010 at 09:24 AM.