MobileRead Forums - View Single Post

ahi · 09-14-2009, 12:21 PM

I would like to post some thoughts, musings, et cetera on text analysis and paragraph detection. While I am giving my own thoughts mostly in relation to the work I am actually doing on pacify, this discussion need not in any way focus on that specific program/use/approach.

---

Paragraph Detection

Detecting line-broken paragraphs actually seems straightforward--assuming the file is at least semi-consistently prepared.

The most straightforward way to detect whether or not paragraphs are line-broken is to simply count what percentage of non-empty lines begin with a character that is not an opening quote, a dash/en-dash/em-dash, an opening parenthesis, or a capital letter.

If the percentage is more than 50 (for a book of any length, much more than 50), chances are very good that it contains line-broken paragraphs.

If this is the case, the best way to go about fixing up the paragraphs, so each paragraph has its own line, is by:

1) run through the entire document, counting how many times certain sequences of whitespace characters occur:

- The most frequent whitespace sequence should be ' ' (i.e.: a single space). Word breaks, if you will.

- The second most frequent whitespace sequence should be whatever whitespace sequence is used to separate intra-paragraph lines. (like a single newline character)

- The third most frequent whitespace sequence should be the paragraph break indicating whitespace sequence. (like two newline characters)

2) Replace all instances of the second most frequent whitespace sequence with a single space.

This will result, in most cases, in a file that has each paragraph on its own line. It may however also incorrectly single-line non-paragraph text. This is usually of minimal consequence, more likely to impact title page text than anything else.

If poems and quotes are indented with leading spaces or a leading tab, they will not be erroneously processed along with paragraphs, as their whitespace sequences will be different from that of intra-paragraph linebreaks.

Also, some files are not 100% consistent in what whitespace sequence separates intra-paragraph lines. Usually the problem is an additional space character either at the beginning or at the end of the line... sometimes.

This can be easily addressed by using whitespace weights instead of the sequences themselves. Instead of counting whitespace sequences in the above described process, I only count weights... spaces are worth 0.24, tabs 2.00, linebreaks 8.00. With such a system \s\r and \r\s and even \s\r\s are worth 8 (if rounded).

The final improvement to this intra-paragraph linebreak fixing method would be to ensure that the whitespace sequence substitution only takes place within paragraphs.

This could be achieved by checking to make sure either

A) The current line is directly preceded and/or followed by one or more lines that are, excepting the final line before an empty line, of average length. And the first of which lines begins with a valid sentence starting characters (capital letter, opening quote/parenthesis, dash/en-dash/em-dash, et cetera)... and the last of which ends with a valid sentence/paragraph ending character (period, colon, exclamation mark, question mark, closing quote, closing parenthesis, et cetera).

Not all of the above should be absolutely required for a given line to be considered to be part of a paragraph, if most are met, a single exception is not necessarily a deal-breaker. e.g.: If all is well, except that the paragraph ends with a comma... it's should probably still be treated like a paragraph.

B) If the current line has neither directly preceding or following lines, check 5-10 lines forward and back to ascertain that those lines contain paragraphs. If they do, the current line is fairly certain to be a paragraph shorter than the line-break line length.

---

Though the above may be long and somewhat meandering... using the above ideas, it should be fairly straightforward to implement a paragraph detecting/fixing algorithm.

I encourage and welcome similarly hairy (but well thought out) descriptions for other text analysis/fixing tasks.

- Ahi

09-14-2009, 12:21 PM	#1
ahi Wizard Posts: 1,790 Karma: 507333 Join Date: May 2009 Device: none	Text Analysis & Paragraph Detection I would like to post some thoughts, musings, et cetera on text analysis and paragraph detection. While I am giving my own thoughts mostly in relation to the work I am actually doing on pacify, this discussion need not in any way focus on that specific program/use/approach. --- Paragraph Detection Detecting line-broken paragraphs actually seems straightforward--assuming the file is at least semi-consistently prepared. The most straightforward way to detect whether or not paragraphs are line-broken is to simply count what percentage of non-empty lines begin with a character that is not an opening quote, a dash/en-dash/em-dash, an opening parenthesis, or a capital letter. If the percentage is more than 50 (for a book of any length, much more than 50), chances are very good that it contains line-broken paragraphs. If this is the case, the best way to go about fixing up the paragraphs, so each paragraph has its own line, is by: 1) run through the entire document, counting how many times certain sequences of whitespace characters occur: - The most frequent whitespace sequence should be ' ' (i.e.: a single space). Word breaks, if you will. - The second most frequent whitespace sequence should be whatever whitespace sequence is used to separate intra-paragraph lines. (like a single newline character) - The third most frequent whitespace sequence should be the paragraph break indicating whitespace sequence. (like two newline characters) 2) Replace all instances of the second most frequent whitespace sequence with a single space. This will result, in most cases, in a file that has each paragraph on its own line. It may however also incorrectly single-line non-paragraph text. This is usually of minimal consequence, more likely to impact title page text than anything else. If poems and quotes are indented with leading spaces or a leading tab, they will not be erroneously processed along with paragraphs, as their whitespace sequences will be different from that of intra-paragraph linebreaks. Also, some files are not 100% consistent in what whitespace sequence separates intra-paragraph lines. Usually the problem is an additional space character either at the beginning or at the end of the line... sometimes. This can be easily addressed by using whitespace weights instead of the sequences themselves. Instead of counting whitespace sequences in the above described process, I only count weights... spaces are worth 0.24, tabs 2.00, linebreaks 8.00. With such a system \s\r and \r\s and even \s\r\s are worth 8 (if rounded). The final improvement to this intra-paragraph linebreak fixing method would be to ensure that the whitespace sequence substitution only takes place within paragraphs. This could be achieved by checking to make sure either A) The current line is directly preceded and/or followed by one or more lines that are, excepting the final line before an empty line, of average length. And the first of which lines begins with a valid sentence starting characters (capital letter, opening quote/parenthesis, dash/en-dash/em-dash, et cetera)... and the last of which ends with a valid sentence/paragraph ending character (period, colon, exclamation mark, question mark, closing quote, closing parenthesis, et cetera). Not all of the above should be absolutely required for a given line to be considered to be part of a paragraph, if most are met, a single exception is not necessarily a deal-breaker. e.g.: If all is well, except that the paragraph ends with a comma... it's should probably still be treated like a paragraph. B) If the current line has neither directly preceding or following lines, check 5-10 lines forward and back to ascertain that those lines contain paragraphs. If they do, the current line is fairly certain to be a paragraph shorter than the line-break line length. --- Though the above may be long and somewhat meandering... using the above ideas, it should be fairly straightforward to implement a paragraph detecting/fixing algorithm. I encourage and welcome similarly hairy (but well thought out) descriptions for other text analysis/fixing tasks. - Ahi