View Single Post
Old 05-29-2026, 08:23 PM   #10
icearch
Groupie
icearch knows what time it isicearch knows what time it isicearch knows what time it isicearch knows what time it isicearch knows what time it isicearch knows what time it isicearch knows what time it isicearch knows what time it isicearch knows what time it isicearch knows what time it isicearch knows what time it is
 
Posts: 157
Karma: 2000
Join Date: Nov 2025
Device: none
I still didn't get what you mean, of course I'm not going to mark every end line of paragraph manually.

Consider our language barrier, I'm showing you with some random text.

1. This is some random text from novel pdf, it contains lines ends with lots of things.
And with my first regex to find any end-paragraph lines.

Click image for larger version

Name:	01.png
Views:	40
Size:	73.2 KB
ID:	223578

2. After first replace:

Click image for larger version

Name:	02.png
Views:	23
Size:	69.0 KB
ID:	223579

3. Get every line else with another tag:

Click image for larger version

Name:	03.png
Views:	23
Size:	64.7 KB
ID:	223584

4. Done with That:

Click image for larger version

Name:	04.png
Views:	15
Size:	69.5 KB
ID:	223585


5. Remove every \n:

Click image for larger version

Name:	05.png
Views:	15
Size:	71.9 KB
ID:	223586

6. After that:

Click image for larger version

Name:	06.png
Views:	12
Size:	48.5 KB
ID:	223587

7. Get desired \n back:

Click image for larger version

Name:	07.png
Views:	10
Size:	8.8 KB
ID:	223588

8. Result:

Click image for larger version

Name:	08.png
Views:	10
Size:	57.0 KB
ID:	223589

9. Get space back

Click image for larger version

Name:	09.png
Views:	11
Size:	9.7 KB
ID:	223590

10. Final result:

Click image for larger version

Name:	10.png
Views:	14
Size:	51.7 KB
ID:	223591

The rest is to place each paragraph in p tags.

I think it pertty much done what it should? I can't understand why you said it can't handle basic brokens that ends with letters.

As to why I need to come up with every punctuation instead of using {P}, that's because it can match :

1. former part of a pair, namly ( [ {

and

2. non-end things like : , ; -

and such. Which you don't want.

Which is highly possible when a broken line ends, and totally avoidable.

Last edited by icearch; 05-29-2026 at 10:22 PM.
icearch is offline   Reply With Quote