View Single Post
Old 08-29-2009, 11:41 AM   #4
ahi
Wizard
ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.
 
Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
I will figure out the cause of the error... but I should perhaps note, nrapallo (in case it is unclear either to you or others) the -p option corrects erroneous paragraph breaking, not systematic paragraph breaking.

The program, regardless of whether the -p option is used, fixes systematic paragraph breaking like:

Code:
Here I am!  I travelled yesterday for four hours in a train.  It's a

funny sensation, isn't it?  I never rode in one before.



College is the biggest, most bewildering place--I get lost whenever I

leave my room.  I will write you a description later when I'm feeling

less muddled; also I will tell you about my lessons.  Classes don't

begin until Monday morning, and this is Saturday night.  But I wanted

to write a letter first just to get acquainted.
What -p would fix would be if the same lines were thus:

Code:
Here I am!  I travelled yesterday for four hours in a train.  It's a

funny sensation, isn't it?  I never rode in one before.



College is the biggest, most bewildering place--I get lost whenever I

leave my room.  I will write you a description later when I'm feeling


less muddled; also I will tell you about my lessons.  Classes don't

begin until Monday morning, and this is Saturday night.  But I wanted

to write a letter first just to get acquainted.
The -p option would detect that the line that the "paragraph" that ends with "... I'm feeling" and is followed by a paragraph that starts with "less muddled; also ..." are almost certainly supposed to be a single paragraph.

While the -p option is good to use (once it works reliably) on all files "just in case" (and since it reports to the user what it changes, you'll know if it corrects something in error)... a file that has no such systematic paragraph errors could be nicely processed with:

pacify.py -i input.txt -cq

Doing so with 157.txt yields the attached. At first look, it seems to work rather nicely, smartening up all single quotes without interfering/being confused by apostrophes... though if and when you find it messed up somewhere in this file, nrapallo, do let me know. It almost certainly get it wrong if there was a word like 'tis that began with a single quote--though since there are not many such words, it's not unreasonable for my program to keep a list of those so it knows to treat them correctly.

- Ahi
Attached Files
File Type: txt output.txt (230.9 KB, 358 views)
ahi is offline   Reply With Quote