Quote:
Originally Posted by ekaser
You're really looking at a statistical analysis (effectively) to figure out what the paragraph formatting of a text file is. Another thing you could include in this process is this maximum line length. You could just figure an "average" line length, and if it's over a certain amount, you probably have a paragraph-break file. Another perhaps better way is to keep a running count of the number of lines of length N, from 0 to ... say... oh, 255. Any line of length >255 gets lumped in with lines of length 255. For line-break files, the bulk of the file will be lines less than some number (80 to 128 max, I'd guess) with very few over that. A paragraph-break file will have far more lines with lengths greater than that limit, with probably quite a few in the 255 bucket.
Also, a 'newline' followed by a QUOTE is almost certainly the start of a paragraph. Count how many times a 'newline'-quote pair is preceded by another 'newline' (ie, paragraphs separated by blank lines). The ratio of the number of 'newline'-'newline'-quote instances to the number of 'newline'-quote instances, would give a pretty good indication of whether it's a line-break or paragraph-break file.
|
Good points. Thanks for that. Based on my own tests, it is already fairly accurate... I think I might need to stop relying solely on the whitespace sequences for determining
whether the file has intra-paragraph linebreaks... and use the whitespace sequences for processing the paragraph fixing once I've conclusively determined that it does.
- Ahi