MobileRead Forums - View Single Post - Using regex to fix broken paragraph in Chinese

icearch · 05-29-2026, 08:23 PM

I still didn't get what you mean, of course I'm not going to mark every end line of paragraph manually.

Consider our language barrier, I'm showing you with some random text.

1. This is some random text from novel pdf, it contains lines ends with lots of things.
And with my first regex to find any end-paragraph lines.

Click image for larger version

Name: 01.png
Views: 40
Size: 73.2 KB
ID: 223578

2. After first replace:

Click image for larger version

Name: 02.png
Views: 23
Size: 69.0 KB
ID: 223579

3. Get every line else with another tag:

Click image for larger version

Name: 03.png
Views: 23
Size: 64.7 KB
ID: 223584

4. Done with That:

Click image for larger version

Name: 04.png
Views: 15
Size: 69.5 KB
ID: 223585

5. Remove every \n:

Click image for larger version

Name: 05.png
Views: 15
Size: 71.9 KB
ID: 223586

6. After that:

Click image for larger version

Name: 06.png
Views: 12
Size: 48.5 KB
ID: 223587

7. Get desired \n back:

Click image for larger version

Name: 07.png
Views: 10
Size: 8.8 KB
ID: 223588

8. Result:

Click image for larger version

Name: 08.png
Views: 10
Size: 57.0 KB
ID: 223589

9. Get space back

Click image for larger version

Name: 09.png
Views: 11
Size: 9.7 KB
ID: 223590

10. Final result:

Click image for larger version

Name: 10.png
Views: 14
Size: 51.7 KB
ID: 223591

The rest is to place each paragraph in p tags.

I think it pertty much done what it should? I can't understand why you said it can't handle basic brokens that ends with letters.

As to why I need to come up with every punctuation instead of using {P}, that's because it can match :

1. former part of a pair, namly ( [ {

and

2. non-end things like : , ; -

and such. Which you don't want.

Which is highly possible when a broken line ends, and totally avoidable.

05-29-2026, 08:23 PM	#10
icearch Groupie Posts: 157 Karma: 2000 Join Date: Nov 2025 Device: none	I still didn't get what you mean, of course I'm not going to mark every end line of paragraph manually. Consider our language barrier, I'm showing you with some random text. 1. This is some random text from novel pdf, it contains lines ends with lots of things. And with my first regex to find any end-paragraph lines. 2. After first replace: 3. Get every line else with another tag: 4. Done with That: 5. Remove every \n: 6. After that: 7. Get desired \n back: 8. Result: 9. Get space back 10. Final result: The rest is to place each paragraph in p tags. I think it pertty much done what it should? I can't understand why you said it can't handle basic brokens that ends with letters. As to why I need to come up with every punctuation instead of using {P}, that's because it can match : 1. former part of a pair, namly ( [ { and 2. non-end things like : , ; - and such. Which you don't want. Which is highly possible when a broken line ends, and totally avoidable. Last edited by icearch; 05-29-2026 at 10:22 PM.