Text Analysis & Paragraph Detection

ahi · 09-14-2009, 11:21 AM

I would like to post some thoughts, musings, et cetera on text analysis and paragraph detection. While I am giving my own thoughts mostly in relation to the work I am actually doing on pacify, this discussion need not in any way focus on that specific program/use/approach.

---

Paragraph Detection

Detecting line-broken paragraphs actually seems straightforward--assuming the file is at least semi-consistently prepared.

The most straightforward way to detect whether or not paragraphs are line-broken is to simply count what percentage of non-empty lines begin with a character that is not an opening quote, a dash/en-dash/em-dash, an opening parenthesis, or a capital letter.

If the percentage is more than 50 (for a book of any length, much more than 50), chances are very good that it contains line-broken paragraphs.

If this is the case, the best way to go about fixing up the paragraphs, so each paragraph has its own line, is by:

1) run through the entire document, counting how many times certain sequences of whitespace characters occur:

- The most frequent whitespace sequence should be ' ' (i.e.: a single space). Word breaks, if you will.

- The second most frequent whitespace sequence should be whatever whitespace sequence is used to separate intra-paragraph lines. (like a single newline character)

- The third most frequent whitespace sequence should be the paragraph break indicating whitespace sequence. (like two newline characters)

2) Replace all instances of the second most frequent whitespace sequence with a single space.

This will result, in most cases, in a file that has each paragraph on its own line. It may however also incorrectly single-line non-paragraph text. This is usually of minimal consequence, more likely to impact title page text than anything else.

If poems and quotes are indented with leading spaces or a leading tab, they will not be erroneously processed along with paragraphs, as their whitespace sequences will be different from that of intra-paragraph linebreaks.

Also, some files are not 100% consistent in what whitespace sequence separates intra-paragraph lines. Usually the problem is an additional space character either at the beginning or at the end of the line... sometimes.

This can be easily addressed by using whitespace weights instead of the sequences themselves. Instead of counting whitespace sequences in the above described process, I only count weights... spaces are worth 0.24, tabs 2.00, linebreaks 8.00. With such a system \s\r and \r\s and even \s\r\s are worth 8 (if rounded).

The final improvement to this intra-paragraph linebreak fixing method would be to ensure that the whitespace sequence substitution only takes place within paragraphs.

This could be achieved by checking to make sure either

A) The current line is directly preceded and/or followed by one or more lines that are, excepting the final line before an empty line, of average length. And the first of which lines begins with a valid sentence starting characters (capital letter, opening quote/parenthesis, dash/en-dash/em-dash, et cetera)... and the last of which ends with a valid sentence/paragraph ending character (period, colon, exclamation mark, question mark, closing quote, closing parenthesis, et cetera).

Not all of the above should be absolutely required for a given line to be considered to be part of a paragraph, if most are met, a single exception is not necessarily a deal-breaker. e.g.: If all is well, except that the paragraph ends with a comma... it's should probably still be treated like a paragraph.

B) If the current line has neither directly preceding or following lines, check 5-10 lines forward and back to ascertain that those lines contain paragraphs. If they do, the current line is fairly certain to be a paragraph shorter than the line-break line length.

---

Though the above may be long and somewhat meandering... using the above ideas, it should be fairly straightforward to implement a paragraph detecting/fixing algorithm.

I encourage and welcome similarly hairy (but well thought out) descriptions for other text analysis/fixing tasks.

- Ahi

ahi · 09-14-2009, 11:34 AM

Quotation Mark Fixing

The way I've been fixing quotation marks is by parsing through the document, character by character, and keeping track of whether the current state of the document is quotation-opened or quotation-closed.

Doing so, however, led to fairly frequent errors due to (legitimately) unclosed quotation marks. As a result, I started overriding the decision of whether to put an opening quotation mark or a closed one based on which side of the quotation mark had alphanumeric characters (as opposed to whitespace or punctuation). This fixed most false positives.

In English, however, there is also the use of apostrophes in words. Therefore single quotation marks that have alphanumeric characters on both sides (e.g.: Steve's, it's, ain't) are considered apostrophes and not quotation marks. Also, any single quotation mark that follows an 's' is considered suspect of being an apostrophe (e.g.: Jesus' name, Boris' house)... suspicion being turned to certainty if the paragraph is yet to have an opening single quote and/or has no subsequent closing single quote or following-line opening single quote (as said line's first character).

The last bit of complication would be words like >> 'Tis <<. This is probably best handled by an exception list... which, while not exhaustive, should work reasonably well for the vast majority of documents. Or, alternatively, the user could be alerted about lone-ranger single quotation marks (as they do, in some PG documents, occur by error... or, rather, sometimes a second single quotation mark fails to occur by error but is discernible by context).

- Ahi

nekokami · 09-14-2009, 11:42 AM

I've been using end-of-line punctuation, rather than beginning-of-line characters, to distinguish paragraph marks. This can lead to false positives if punctuation just happens to fall at the end of a line, but my results have been fairly good so far.

One problem with relying on quotation marks specifically is that in English language texts, often a quote that runs for more than one paragraph does not have closing quotation marks for the earlier paragraph(s), but only for the final paragraph of the quote. The use of single quotes as both apostrophes and as dialogue markers (more common in British than American English) can be a problem, especially in the case of plural possessives, as you describe above.

Sparrow · 09-14-2009, 11:45 AM

Quote:

Originally Posted by ahi

The most straightforward way to detect whether or not paragraphs are line-broken is to simply count what percentage of non-empty lines begin with a character that is not an opening quote, a dash/en-dash/em-dash, an opening parenthesis, or a capital letter.

Like Neko, I look for the character that precedes the line break. In MSWord the line break is generally the ^p character.
I replace all space+^p occurrences with ^p; repeating until all such spaces are removed.

Then for any letter/number, or non-full stop punctuation (except quotes) I replace the following ^p with a space.

Hyphens get replaced individually, since some may need to be retained.

This is quick and dirty - it will retain full-stop+^p when they should be full-stop+space - but the process is normally just prep for proof-reading. (Or I can just opt to live with those inaccuracies.)
Also, verses need to be edited manually.

ahi · 09-14-2009, 12:02 PM

Quote:

Originally Posted by nekokami

I've been using end-of-line punctuation, rather than beginning-of-line characters, to distinguish paragraph marks. This can lead to false positives if punctuation just happens to fall at the end of a line, but my results have been fairly good so far.

One problem with relying on quotation marks specifically is that in English language texts, often a quote that runs for more than one paragraph does not have closing quotation marks for the earlier paragraph(s), but only for the final paragraph of the quote. The use of single quotes as both apostrophes and as dialogue markers (more common in British than American English) can be a problem, especially in the case of plural possessives, as you describe above.

The algorithm vaguely described above does fairly well with paragraph-spanning quotation marks, single quotation marks, and apostrophes within a single document.

Of course, there's no real way to put that into a single regex... probably requires at least a dozen line script.

Quote:

Originally Posted by Sparrow

Also, verses need to be edited manually.

I think verses should be detectable too... even if not helpfully preceded (on each line) with additional whitespace. Basicaly you are looking for irregular lines... less than average length, perhaps all ending on punctuation (but not always on sentence-ending punctuation)... possibly several starting with capitals despite there being no sentence-ending punctuation on the preceding line.

I've not actually attacked this problem yet... but when I do, I'll post my ideas in detail.

I think it should be possible for the majority of straightforward books to autodetect chapter titles and verses/quoted portions... with considerable accuracy.

- Ahi

Jellby · 09-14-2009, 12:24 PM

Quote:

Originally Posted by ahi

The last bit of complication would be words like >> 'Tis <<.

And things like Cockney accent, you better be careful, or the dog might bite before you see 'im comin'

ahi · 09-14-2009, 12:27 PM

Quote:

Originally Posted by Jellby

And things like Cockney accent, you better be careful, or the dog might bite before you see 'im comin'

He he... how likely is a text full of such delightful colloquialism to opt to use single quotation marks?

- Ahi

Ps.: Though if it did, the exclusion list approach might be an alright way of handling it. If we know the text is English, >> 'im << is never (unless "Im" is a proper noun... but being uncapitalized, it isn't) the beginning of a quote... and >> comin' << is likewise never (at least correctly) the end of one.

nekokami · 09-14-2009, 12:33 PM

So do you have a regexp to share for the first algorithm? Some Word macros that people could download to implement these would be nice.

I believe the next version of Calibre is supposed to include my algorithm, but yours might have been a better choice....

ahi · 09-14-2009, 12:43 PM

Quote:

Originally Posted by nekokami

So do you have a regexp to share for the first algorithm? Some Word macros that people could download to implement these would be nice.

I believe the next version of Calibre is supposed to include my algorithm, but yours might have been a better choice....

No. I don't use regexes. I find they are too fragile and their full impact is difficult to accurately foresee. (i.e.: They fail too easily and sometimes change things they weren't intended to by the regex "writer".)

Pacify more or less uses the algorithm described for fixing quotation marks... you are welcome to download it and play with it and/or check out the source. But it is not ready for primetime as yet.

A word macro might be possible... assuming they use VBScript or JScript... but I haven't done that sort of thing in a while. Perhaps you'd like to take a crack at it, based on the description and the code in pacify.py?

- Ahi

nekokami · 09-14-2009, 12:49 PM

I'd rather see pacify incorporated into Calibre and/or Sigil.

ahi · 09-14-2009, 12:52 PM

Quote:

Originally Posted by nekokami

I'd rather see pacify incorporated into Calibre and/or Sigil.

Hehe.

Perhaps once it's reasonably stable, I'll offer it to all and sundry for seemless incorporation into their backends.

I'd be curious to know whether you find it works as well or better than your own approach. I think it should (even the version featured on the first post of the thread I linked to)... but I'd be grateful to know with certainty, if you are up to doing a few checks.

- Ahi

nekokami · 09-14-2009, 12:54 PM

I'll see what I can do over the next couple of days.

Jellby · 09-14-2009, 01:34 PM

Quote:

Originally Posted by ahi

He he... how likely is a text full of such delightful colloquialism to opt to use single quotation marks?

Meet Mr. Galsworthy's The Forsyte Chronicles, or some of the Wodehouse works

ahi · 09-14-2009, 01:41 PM

Quote:

Originally Posted by Jellby

Meet Mr. Galsworthy's The Forsyte Chronicles, or some of the Wodehouse works

Well... in that case, it shouldn't be too hard to build up a reasonably strong exclusion list.

Although looking the Project Gutenberg's list of Wodehouse's stuff... it doesn't seem as frightening as your post first made me think.

- Ahi

Jellby · 09-14-2009, 01:52 PM

Quote:

Originally Posted by ahi

Although looking the Project Gutenberg's list of Wodehouse's stuff... it doesn't seem as frightening as your post first made me think.

No, it's not so bad. It's just stuff that happens.

09-14-2009, 11:21 AM	#1
ahi Wizard Posts: 1,790 Karma: 507333 Join Date: May 2009 Device: none	Text Analysis & Paragraph Detection I would like to post some thoughts, musings, et cetera on text analysis and paragraph detection. While I am giving my own thoughts mostly in relation to the work I am actually doing on pacify, this discussion need not in any way focus on that specific program/use/approach. --- Paragraph Detection Detecting line-broken paragraphs actually seems straightforward--assuming the file is at least semi-consistently prepared. The most straightforward way to detect whether or not paragraphs are line-broken is to simply count what percentage of non-empty lines begin with a character that is not an opening quote, a dash/en-dash/em-dash, an opening parenthesis, or a capital letter. If the percentage is more than 50 (for a book of any length, much more than 50), chances are very good that it contains line-broken paragraphs. If this is the case, the best way to go about fixing up the paragraphs, so each paragraph has its own line, is by: 1) run through the entire document, counting how many times certain sequences of whitespace characters occur: - The most frequent whitespace sequence should be ' ' (i.e.: a single space). Word breaks, if you will. - The second most frequent whitespace sequence should be whatever whitespace sequence is used to separate intra-paragraph lines. (like a single newline character) - The third most frequent whitespace sequence should be the paragraph break indicating whitespace sequence. (like two newline characters) 2) Replace all instances of the second most frequent whitespace sequence with a single space. This will result, in most cases, in a file that has each paragraph on its own line. It may however also incorrectly single-line non-paragraph text. This is usually of minimal consequence, more likely to impact title page text than anything else. If poems and quotes are indented with leading spaces or a leading tab, they will not be erroneously processed along with paragraphs, as their whitespace sequences will be different from that of intra-paragraph linebreaks. Also, some files are not 100% consistent in what whitespace sequence separates intra-paragraph lines. Usually the problem is an additional space character either at the beginning or at the end of the line... sometimes. This can be easily addressed by using whitespace weights instead of the sequences themselves. Instead of counting whitespace sequences in the above described process, I only count weights... spaces are worth 0.24, tabs 2.00, linebreaks 8.00. With such a system \s\r and \r\s and even \s\r\s are worth 8 (if rounded). The final improvement to this intra-paragraph linebreak fixing method would be to ensure that the whitespace sequence substitution only takes place within paragraphs. This could be achieved by checking to make sure either A) The current line is directly preceded and/or followed by one or more lines that are, excepting the final line before an empty line, of average length. And the first of which lines begins with a valid sentence starting characters (capital letter, opening quote/parenthesis, dash/en-dash/em-dash, et cetera)... and the last of which ends with a valid sentence/paragraph ending character (period, colon, exclamation mark, question mark, closing quote, closing parenthesis, et cetera). Not all of the above should be absolutely required for a given line to be considered to be part of a paragraph, if most are met, a single exception is not necessarily a deal-breaker. e.g.: If all is well, except that the paragraph ends with a comma... it's should probably still be treated like a paragraph. B) If the current line has neither directly preceding or following lines, check 5-10 lines forward and back to ascertain that those lines contain paragraphs. If they do, the current line is fairly certain to be a paragraph shorter than the line-break line length. --- Though the above may be long and somewhat meandering... using the above ideas, it should be fairly straightforward to implement a paragraph detecting/fixing algorithm. I encourage and welcome similarly hairy (but well thought out) descriptions for other text analysis/fixing tasks. - Ahi

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Converting from LRF: Paragraph & Line Breaks	wudaben	LRF	0	07-14-2010 11:32 PM
Search & replace TEXT	ToeRag	Calibre	3	04-10-2010 01:44 PM
Indentations & Paragraph Spacing Loss	Dis	Sigil	6	12-03-2009 02:18 PM
Sony PRS-505, text indents, paragraph spacing	pdurrant	Sigil	7	08-03-2009 06:03 AM
Cybook & text-based pdfs	StephieP	Bookeen	17	04-28-2008 11:50 AM

09-14-2009, 11:34 AM	#2
ahi Wizard Posts: 1,790 Karma: 507333 Join Date: May 2009 Device: none	Quotation Mark Fixing The way I've been fixing quotation marks is by parsing through the document, character by character, and keeping track of whether the current state of the document is quotation-opened or quotation-closed. Doing so, however, led to fairly frequent errors due to (legitimately) unclosed quotation marks. As a result, I started overriding the decision of whether to put an opening quotation mark or a closed one based on which side of the quotation mark had alphanumeric characters (as opposed to whitespace or punctuation). This fixed most false positives. In English, however, there is also the use of apostrophes in words. Therefore single quotation marks that have alphanumeric characters on both sides (e.g.: Steve's, it's, ain't) are considered apostrophes and not quotation marks. Also, any single quotation mark that follows an 's' is considered suspect of being an apostrophe (e.g.: Jesus' name, Boris' house)... suspicion being turned to certainty if the paragraph is yet to have an opening single quote and/or has no subsequent closing single quote or following-line opening single quote (as said line's first character). The last bit of complication would be words like >> 'Tis <<. This is probably best handled by an exception list... which, while not exhaustive, should work reasonably well for the vast majority of documents. Or, alternatively, the user could be alerted about lone-ranger single quotation marks (as they do, in some PG documents, occur by error... or, rather, sometimes a second single quotation mark fails to occur by error but is discernible by context). - Ahi

09-14-2009, 11:42 AM	#3
nekokami fruminous edugeek Posts: 6,745 Karma: 551260 Join Date: Oct 2006 Location: Northeast US Device: iPad, eBw 1150	I've been using end-of-line punctuation, rather than beginning-of-line characters, to distinguish paragraph marks. This can lead to false positives if punctuation just happens to fall at the end of a line, but my results have been fairly good so far. One problem with relying on quotation marks specifically is that in English language texts, often a quote that runs for more than one paragraph does not have closing quotation marks for the earlier paragraph(s), but only for the final paragraph of the quote. The use of single quotes as both apostrophes and as dialogue markers (more common in British than American English) can be a problem, especially in the case of plural possessives, as you describe above.

09-14-2009, 12:33 PM	#8
nekokami fruminous edugeek Posts: 6,745 Karma: 551260 Join Date: Oct 2006 Location: Northeast US Device: iPad, eBw 1150	So do you have a regexp to share for the first algorithm? Some Word macros that people could download to implement these would be nice. I believe the next version of Calibre is supposed to include my algorithm, but yours might have been a better choice....

09-14-2009, 12:49 PM	#10
nekokami fruminous edugeek Posts: 6,745 Karma: 551260 Join Date: Oct 2006 Location: Northeast US Device: iPad, eBw 1150	I'd rather see pacify incorporated into Calibre and/or Sigil.

09-14-2009, 12:54 PM	#12
nekokami fruminous edugeek Posts: 6,745 Karma: 551260 Join Date: Oct 2006 Location: Northeast US Device: iPad, eBw 1150	I'll see what I can do over the next couple of days.

Advert

Advert