MobileRead Forums - View Single Post - Cleaning books--Book Designer or other?

dstampe · 12-23-2007, 06:06 PM

So the consensus seems to be to avoid using BD for this process if possible. Bummer.

I've seen the macros, and developed similar ones that handle a lot of the reformatting when there are ways to identify paragraph ends (double line breaks or indents). Haven't had as much luck creating "plausible" paragraphs where there are none, or identifying numbers/roman numeral labelled chapters.

The idea I had for detecting plausible paragraph ends rely on finding line breaks that arre preceded by sentence ends [.!?] and followed by sentence starts [A-Z]. This would also have to handle the case of quote marks (apostrophes for British books) wrapping the sentence start/end
. I really don't know how well this would work, I suspect it will be correct in 90% of cases. The 10% of errors will be acceptable, unless the error splits dialog between quotes. Problem is that tagged substitution is not as easy to use to flag blocks as a real parser would do.

Then there's the issue of recognizing chapter headings. It would be best to have a search criterion that matches all possible "hits" at once, but I'm not sure regular expressions can handle numbers, text, and Roman numerals in the same search string.

Any thoughts on these?

12-23-2007, 06:06 PM	#8
dstampe dstampe Posts: 50 Karma: 17 Join Date: Jan 2007 Location: Canada Device: Sony PRS-500	So the consensus seems to be to avoid using BD for this process if possible. Bummer. I've seen the macros, and developed similar ones that handle a lot of the reformatting when there are ways to identify paragraph ends (double line breaks or indents). Haven't had as much luck creating "plausible" paragraphs where there are none, or identifying numbers/roman numeral labelled chapters. The idea I had for detecting plausible paragraph ends rely on finding line breaks that arre preceded by sentence ends [.!?] and followed by sentence starts [A-Z]. This would also have to handle the case of quote marks (apostrophes for British books) wrapping the sentence start/end . I really don't know how well this would work, I suspect it will be correct in 90% of cases. The 10% of errors will be acceptable, unless the error splits dialog between quotes. Problem is that tagged substitution is not as easy to use to flag blocks as a real parser would do. Then there's the issue of recognizing chapter headings. It would be best to have a search criterion that matches all possible "hits" at once, but I'm not sure regular expressions can handle numbers, text, and Roman numerals in the same search string. Any thoughts on these?