MobileRead Forums - View Single Post - pacify.py (Text reformatter / RTF extractor)

ekaser · 09-14-2009, 02:35 PM

Quote:

Originally Posted by ahi

This stuff is based on a percentage calculation (abused beyond recognition) done on the text length (file size, sans formatting) and the whitespace pattern frequency... the first one is the most frequent whitespace sequence, the second one the second most frequent, the third one the third most frequent...

You're really looking at a statistical analysis (effectively) to figure out what the paragraph formatting of a text file is. Another thing you could include in this process is this maximum line length. You could just figure an "average" line length, and if it's over a certain amount, you probably have a paragraph-break file. Another perhaps better way is to keep a running count of the number of lines of length N, from 0 to ... say... oh, 255. Any line of length >255 gets lumped in with lines of length 255. For line-break files, the bulk of the file will be lines less than some number (80 to 128 max, I'd guess) with very few over that. A paragraph-break file will have far more lines with lengths greater than that limit, with probably quite a few in the 255 bucket.

Also, a 'newline' followed by a QUOTE is almost certainly the start of a paragraph. Count how many times a 'newline'-quote pair is preceded by another 'newline' (ie, paragraphs separated by blank lines). The ratio of the number of 'newline'-'newline'-quote instances to the number of 'newline'-quote instances, would give a pretty good indication of whether it's a line-break or paragraph-break file.

If you use at least 2 or 3 different "statistical rulers" and they all agree (or 2 out of 3), then that's about the best indicator you're going to get.