View Single Post
Old 09-14-2009, 02:38 PM   #60
ahi
Wizard
ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.
 
Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
Quote:
Originally Posted by ekaser View Post
You're really looking at a statistical analysis (effectively) to figure out what the paragraph formatting of a text file is. Another thing you could include in this process is this maximum line length. You could just figure an "average" line length, and if it's over a certain amount, you probably have a paragraph-break file. Another perhaps better way is to keep a running count of the number of lines of length N, from 0 to ... say... oh, 255. Any line of length >255 gets lumped in with lines of length 255. For line-break files, the bulk of the file will be lines less than some number (80 to 128 max, I'd guess) with very few over that. A paragraph-break file will have far more lines with lengths greater than that limit, with probably quite a few in the 255 bucket.

Also, a 'newline' followed by a QUOTE is almost certainly the start of a paragraph. Count how many times a 'newline'-quote pair is preceded by another 'newline' (ie, paragraphs separated by blank lines). The ratio of the number of 'newline'-'newline'-quote instances to the number of 'newline'-quote instances, would give a pretty good indication of whether it's a line-break or paragraph-break file.
Good points. Thanks for that. Based on my own tests, it is already fairly accurate... I think I might need to stop relying solely on the whitespace sequences for determining whether the file has intra-paragraph linebreaks... and use the whitespace sequences for processing the paragraph fixing once I've conclusively determined that it does.

- Ahi
ahi is offline   Reply With Quote