View Single Post
Old 05-29-2026, 04:31 PM   #8
icearch
Groupie
icearch knows what time it isicearch knows what time it isicearch knows what time it isicearch knows what time it isicearch knows what time it isicearch knows what time it isicearch knows what time it isicearch knows what time it isicearch knows what time it isicearch knows what time it isicearch knows what time it is
 
Posts: 157
Karma: 2000
Join Date: Nov 2025
Device: none
Quote:
Originally Posted by ElMiko View Post
I mean, I guess what I'm saying then is that you could just as easily search for \p{P}\n and it's effectively doing the same thing that your original search is doing: namely, finding the end of every single line in the text file that ends in a punctuation. (NOTE: this is in the context of English punctuation; I haven't tested to what extent \p{P} matches non-English punctuation). It's still going to be thousands and thousands of matches.

one thing you could do is replace wherever i have "</p>s+<p[^>]*?>" with "\n" and you should get the same effect in the context of a plain text file. (the italics tag will obviously be ignored).

In any event, one thing that your search doesn't capture, ironically, is lines that end in a letter... i.e. the most common type of broken line in pdf conversions. My search is super old so I was still using "[a-z]" for capturing lower case letters when i first started iterating it, but it'd probably be improved by using "\p{Ll}" (that's uppercase L, followed by lowercase L, not uppercase I). Or even just "\p{L}"...
I mean... yes? I didn't quite understand what you are saying here.

You said that I didn't capture the most common type of broken line which is when it ends with characters, that is what I'm going to achieve.

Because I'm going to:

1. Find and mark any line that looks like an end line of a paragraph, that is the one end with punctuation. i.e. the non-broken line.

2. Mark that end line. So all broken lines are not marked.

3. Remove any \n, so everything merge into a giant paragraph, with special markings to indicate where every paragraph suppose to end.

4. Replace end markings with \n.

To not try to find the broken lines I avoided to distinguish all werid conditions, after all the ultimate goal is to re-arrange the paragraphs, merging all together and than separate them works fine too.

As to have thousands of result, yeees...? Finding broken ones will get tens times more result, so I really didn't get what you mean.

Hope the best.

Last edited by icearch; 05-29-2026 at 04:34 PM.
icearch is offline   Reply With Quote