Quote:
Originally Posted by ekaser
None that I saw, but then I only tried the four files, so that's a fairly limited set of data points. 
|
True.
This stuff:
Quote:
[1]: 1548.84871249
[1]: 137.851965984
[1]: 1.05804710955
[1]: 0.702609408684
[1]: 0.32237372869
|
is based on a percentage calculation (abused beyond recognition) done on the text length (file size, sans formatting) and the whitespace pattern frequency... the first one is the most frequent whitespace sequence, the second one the second most frequent, the third one the third most frequent...
The first one, of course, is "single space" (word breaks). The second one is either paragraph breaks or linebreaks (if there are intra-paragraph linebreaks) whitespace sequence count and the third one (if there are intra-paragraph linebreaks) is the paragraph breaks whitespace sequence count.
Usually, if the second value is above 50, and the third value above 3, it indicates the present of intra-paragraph breaks... but hard values are not the right way to go.
I need to figure out the calculation (perhaps ratios of the whitespace sequence counts?) that yields reliable results in "all" cases. As I suspect there might easily be files out there where there are intra-paragraph linebreaks but perhaps the third value would only come to 2.9. What makes this hard though is that the second and third values do vary pretty wildly... so I'm not too sure comparisons/ratios between those two are the way.
- Ahi