MobileRead Forums - View Single Post

ahi · 06-03-2009, 01:12 PM

Assuming that the paragraph break information in Gideon's file is not impossible to accurately retrieve, throwing together a python script that works thusly might help:

In sequentially parsing the text file, build a list that consists of (1) non-whitespace character sequence strings, and (2) numbers indicating consecutive whitespace weights. As explained in my "text processing ideas" post, the weighting ought to assign a value of 1 for each space character, and 1000 for each linebreak (which can either be chr(10), chr(13) or the two together [which still should be counted as a single linebreak]). (Tabs, I suppose could be counted as having a weight of 4 or 8 or even larger.)

1 linebreak + 5 spaces = 1005
2 linebreaks + 0 spaces = 2000
1 space + 2 linebreaks + 5 spaces = 2006

Once this is done, you basically have a list that you could process sequentially to recreate the input file (save for the whitespaces).

At that time, take all the whitespace weights and put them in a separate list, and get the mode of that list. It will be the weight of whitespace used to separate words.

Remove from the list of whitespace weights all instances of the mode weight, and take the mode of what remains. It will be the weight of whitespace used to separate lines.

Then remove from the list of whitespace weights all instances of the mode weight, and take the mode of what again remains. It will be the weight of whitespace used to separate paragraphs.

Output the list, replacing the whitespace weights with the appropriate characters (space for word spacing, space for line spacing, linebreak for paragraph spacing).

Of course, such an approach won't work if the text file is as my fellow forum members suggest it to be. But if they are guessing or mistaken, I might throw this script together this evening and post it...

- Ahi

06-03-2009, 01:12 PM	#24
ahi Wizard Posts: 1,790 Karma: 507333 Join Date: May 2009 Device: none	Assuming that the paragraph break information in Gideon's file is not impossible to accurately retrieve, throwing together a python script that works thusly might help: In sequentially parsing the text file, build a list that consists of (1) non-whitespace character sequence strings, and (2) numbers indicating consecutive whitespace weights. As explained in my "text processing ideas" post, the weighting ought to assign a value of 1 for each space character, and 1000 for each linebreak (which can either be chr(10), chr(13) or the two together [which still should be counted as a single linebreak]). (Tabs, I suppose could be counted as having a weight of 4 or 8 or even larger.) 1 linebreak + 5 spaces = 1005 2 linebreaks + 0 spaces = 2000 1 space + 2 linebreaks + 5 spaces = 2006 Once this is done, you basically have a list that you could process sequentially to recreate the input file (save for the whitespaces). At that time, take all the whitespace weights and put them in a separate list, and get the mode of that list. It will be the weight of whitespace used to separate words. Remove from the list of whitespace weights all instances of the mode weight, and take the mode of what remains. It will be the weight of whitespace used to separate lines. Then remove from the list of whitespace weights all instances of the mode weight, and take the mode of what again remains. It will be the weight of whitespace used to separate paragraphs. Output the list, replacing the whitespace weights with the appropriate characters (space for word spacing, space for line spacing, linebreak for paragraph spacing). Of course, such an approach won't work if the text file is as my fellow forum members suggest it to be. But if they are guessing or mistaken, I might throw this script together this evening and post it... - Ahi