Quote:
Originally Posted by nekokami
I've been using end-of-line punctuation, rather than beginning-of-line characters, to distinguish paragraph marks. This can lead to false positives if punctuation just happens to fall at the end of a line, but my results have been fairly good so far.
One problem with relying on quotation marks specifically is that in English language texts, often a quote that runs for more than one paragraph does not have closing quotation marks for the earlier paragraph(s), but only for the final paragraph of the quote. The use of single quotes as both apostrophes and as dialogue markers (more common in British than American English) can be a problem, especially in the case of plural possessives, as you describe above. 
|
The algorithm vaguely described above does fairly well with paragraph-spanning quotation marks, single quotation marks, and apostrophes within a single document.
Of course, there's no real way to put that into a single regex... probably requires at least a dozen line script.
Quote:
Originally Posted by Sparrow
Also, verses need to be edited manually.
|
I think verses should be detectable too... even if not helpfully preceded (on each line) with additional whitespace. Basicaly you are looking for irregular lines... less than average length, perhaps all ending on punctuation (but not always on sentence-ending punctuation)... possibly several starting with capitals despite there being no sentence-ending punctuation on the preceding line.
I've not actually attacked this problem yet... but when I do, I'll post my ideas in detail.
I think it should be possible for the majority of straightforward books to autodetect chapter titles and verses/quoted portions... with considerable accuracy.
- Ahi