Quote:
Originally Posted by icearch
To be clear, I'm using plain text to do the correciton. working this with lot's of tags would be a pain in the a$$.
|
I mean, I guess what I'm saying then is that you could just as easily search for \p{P}\n and it's effectively doing the same thing that your original search is doing: namely, finding the end of every single line in the text file that ends in a punctuation. (NOTE: this is in the context of English punctuation; I haven't tested to what extent \p{P} matches non-English punctuation). It's still going to be thousands and
thousands of matches.
one thing you could do is replace wherever i have "</p>s+<p[^>]*?>" with "\n" and you
should get the same effect in the context of a plain text file. (the italics tag will obviously be ignored).
In any event, one thing that your search doesn't capture, ironically, is lines that end in a letter... i.e. the most common type of broken line in pdf conversions. My search is super old so I was still using "[a-z]" for capturing lower case letters when i first started iterating it, but it'd probably be improved by using "\p{Ll}" (that's uppercase L, followed by lowercase L, not uppercase I). Or even just "\p{L}"...