View Single Post
Old 03-20-2010, 02:44 AM   #8
NightGeometry
Zealot
NightGeometry ought to be getting tired of karma fortunes by now.NightGeometry ought to be getting tired of karma fortunes by now.NightGeometry ought to be getting tired of karma fortunes by now.NightGeometry ought to be getting tired of karma fortunes by now.NightGeometry ought to be getting tired of karma fortunes by now.NightGeometry ought to be getting tired of karma fortunes by now.NightGeometry ought to be getting tired of karma fortunes by now.NightGeometry ought to be getting tired of karma fortunes by now.NightGeometry ought to be getting tired of karma fortunes by now.NightGeometry ought to be getting tired of karma fortunes by now.NightGeometry ought to be getting tired of karma fortunes by now.
 
NightGeometry's Avatar
 
Posts: 139
Karma: 1057240
Join Date: Mar 2007
Location: Brighton, England
Device: Sony PRS-T1, Kindle 3G, Kindle DX
I haven't checked your script, but some of the checks I did last time I cleaned a gutenburg text included:
- if the next character after a linebreak isn't a lowercase character, assume it's a real line break.
- if the character before the line break is a full stop, assume it's a real line break.

It seemed to work for me, not sure I still have the regex's around, I'll go have a look.
NightGeometry is offline   Reply With Quote