Quote:
Originally Posted by ElMiko
I mean, I guess what I'm saying then is that you could just as easily search for \p{P}\n and it's effectively doing the same thing that your original search is doing: namely, finding the end of every single line in the text file that ends in a punctuation. (NOTE: this is in the context of English punctuation; I haven't tested to what extent \p{P} matches non-English punctuation). It's still going to be thousands and thousands of matches.
one thing you could do is replace wherever i have "</p>s+<p[^>]*?>" with "\n" and you should get the same effect in the context of a plain text file. (the italics tag will obviously be ignored).
In any event, one thing that your search doesn't capture, ironically, is lines that end in a letter... i.e. the most common type of broken line in pdf conversions. My search is super old so I was still using "[a-z]" for capturing lower case letters when i first started iterating it, but it'd probably be improved by using "\p{Ll}" (that's uppercase L, followed by lowercase L, not uppercase I). Or even just "\p{L}"...
|
I mean... yes? I didn't quite understand what you are saying here.
You said that I didn't capture the most common type of broken line which is when it ends with characters, that is what I'm going to achieve.
Because I'm going to:
1. Find and mark any line that looks like an end line of a paragraph, that is the one end with punctuation. i.e. the non-broken line.
2. Mark that end line. So all broken lines are not marked.
3. Remove any \n, so everything merge into a giant paragraph, with special markings to indicate where every paragraph suppose to end.
4. Replace end markings with \n.
To not try to find the broken lines I avoided to distinguish all werid conditions, after all the ultimate goal is to re-arrange the paragraphs, merging all together and than separate them works fine too.
As to have thousands of result, yeees...? Finding broken ones will get tens times more result, so I really didn't get what you mean.
Hope the best.