I haven't checked your script, but some of the checks I did last time I cleaned a gutenburg text included:
- if the next character after a linebreak isn't a lowercase character, assume it's a real line break.
- if the character before the line break is a full stop, assume it's a real line break.
It seemed to work for me, not sure I still have the regex's around, I'll go have a look.
|