View Single Post
Old 05-29-2026, 01:19 PM   #7
ElMiko
Fanatic
ElMiko has read every ebook posted at MobileReadElMiko has read every ebook posted at MobileReadElMiko has read every ebook posted at MobileReadElMiko has read every ebook posted at MobileReadElMiko has read every ebook posted at MobileReadElMiko has read every ebook posted at MobileReadElMiko has read every ebook posted at MobileReadElMiko has read every ebook posted at MobileReadElMiko has read every ebook posted at MobileReadElMiko has read every ebook posted at MobileReadElMiko has read every ebook posted at MobileRead
 
ElMiko's Avatar
 
Posts: 570
Karma: 65460
Join Date: Jun 2011
Device: Kindle Voyage, Boox Go 7
Quote:
Originally Posted by icearch View Post
To be clear, I'm using plain text to do the correciton. working this with lot's of tags would be a pain in the a$$.
I mean, I guess what I'm saying then is that you could just as easily search for \p{P}\n and it's effectively doing the same thing that your original search is doing: namely, finding the end of every single line in the text file that ends in a punctuation. (NOTE: this is in the context of English punctuation; I haven't tested to what extent \p{P} matches non-English punctuation). It's still going to be thousands and thousands of matches.

one thing you could do is replace wherever i have "</p>s+<p[^>]*?>" with "\n" and you should get the same effect in the context of a plain text file. (the italics tag will obviously be ignored).

In any event, one thing that your search doesn't capture, ironically, is lines that end in a letter... i.e. the most common type of broken line in pdf conversions. My search is super old so I was still using "[a-z]" for capturing lower case letters when i first started iterating it, but it'd probably be improved by using "\p{Ll}" (that's uppercase L, followed by lowercase L, not uppercase I). Or even just "\p{L}"...

Last edited by ElMiko; 05-29-2026 at 01:47 PM.
ElMiko is offline   Reply With Quote