MobileRead Forums - View Single Post - Using regex to fix broken paragraph in Chinese

ElMiko · 05-30-2026, 07:19 AM

When you mark a line with @@@, you are indicating that it's a paragraph break, right? (this is so that you can re-break it at the end of your regex cycle.)

You remember my example of dialogue that's been broken at a terminal punctuation?

Code:

“Watch out! Stay away from there! It's not safe.

Stay close to me,” he said.

or how about:

Code:

“Watch out! Stay away from there! It's not safe,”

he warned, before turning around and running away.

or how about:

Code:

At this point, I looked beseechingly at Mr.

Jones, and said, “He left us!”

In all three of the cases above, doing a bulk search and replace with your regex will add "@@@" to the end of the first line. Which means that when you do the final step of your regex sequence (replacing "@@@" with "\n") you'll just end up rebreaking what should be an unbroken paragraph.

The issue with the assumption built into your original regex (if you don't have a separate regex to deal with the kinds of situations I outlined above) is that it incorrectly treats all terminal punctuation marks as necessarily preceding a paragraph break, and it treat several kinds of punctuation marks as terminal when they aren't necessarily terminal at all (e.g. ” does not always denote the end of a sentence, much less the end of a paragraph).

The only way to avoid this would be to check one-by-one, inserting the "@@@" manually when you determine that it really IS the end of a paragraph. And doing THAT would take forever.

---

For reasons that escape me, my regex got corrupted at some point on MobileReads, and it replaced some elements with asterisks that aren’t in the original regex. In any event, the original regex I shared with you wouldn't work because it's desinged for html, not plain text. (also, I'm not sure what text editor you're using... I use Sigil... but then again I only edit in html.)

I modified it for plain text here:

Code:

(\p{Ll}|\p{Ll}-|,|(?<!nbsp|&#160);|,”|[MD][rs]\.|Mrs\.|\b[AI]|(”|—|</i>)(?=\n\p{Ll}))\n

when paired with a replace value of "\1[space])" (as in an actual blank space, not the text "[space]" it will rejoin the following text at the yellow highlights in the image below.

Click image for larger version

Name: RejoiningLines.jpg
Views: 14
Size: 118.7 KB
ID: 223598

Note that it will not rejoin at two points (marked with red highlights) where the lines OUGHT to be joined. But your regex sequence will have the same problem, too.

To solve the second red highlight I use a regex search that specifically targets broken/incomplete opening/closing quotations.

To solve the first red hightlight... as far as I can tell, you can't. Except by going line by line and fixing it manually.

Incidentally, this last point goes to something i said in a related thread recently: this is one of many issues with trying to fix direct PDF-to-EPUB conversions. You're much better off running OCR on the original image and producing a pdf reference copy and a separate html/epub copy. Most OCR software will be smart enough to recognize the vast majority (90%) of correct paragraph breaks.

EDIT: it corrupted the code again... and i've fixed it again. Hopefully it stays fixed this time...

05-30-2026, 07:19 AM	#13
ElMiko Fanatic Posts: 570 Karma: 65460 Join Date: Jun 2011 Device: Kindle Voyage, Boox Go 7	When you mark a line with @@@, you are indicating that it's a paragraph break, right? (this is so that you can re-break it at the end of your regex cycle.) You remember my example of dialogue that's been broken at a terminal punctuation? Code: “Watch out! Stay away from there! It's not safe. Stay close to me,” he said. or how about: Code: “Watch out! Stay away from there! It's not safe,” he warned, before turning around and running away. or how about: Code: At this point, I looked beseechingly at Mr. Jones, and said, “He left us!” In all three of the cases above, doing a bulk search and replace with your regex will add "@@@" to the end of the first line. Which means that when you do the final step of your regex sequence (replacing "@@@" with "\n") you'll just end up rebreaking what should be an unbroken paragraph. The issue with the assumption built into your original regex (if you don't have a separate regex to deal with the kinds of situations I outlined above) is that it incorrectly treats all terminal punctuation marks as necessarily preceding a paragraph break, and it treat several kinds of punctuation marks as terminal when they aren't necessarily terminal at all (e.g. ” does not always denote the end of a sentence, much less the end of a paragraph). The only way to avoid this would be to check one-by-one, inserting the "@@@" manually when you determine that it really IS the end of a paragraph. And doing THAT would take forever. --- For reasons that escape me, my regex got corrupted at some point on MobileReads, and it replaced some elements with asterisks that aren’t in the original regex. In any event, the original regex I shared with you wouldn't work because it's desinged for html, not plain text. (also, I'm not sure what text editor you're using... I use Sigil... but then again I only edit in html.) I modified it for plain text here: Code: (\p{Ll}\|\p{Ll}-\|,\|(?<!nbsp\|&#160);\|,”\|[MD][rs]\.\|Mrs\.\|\b[AI]\|(”\|—\|</i>)(?=\n\p{Ll}))\n when paired with a replace value of "\1[space])" (as in an actual blank space, not the text "[space]" it will rejoin the following text at the yellow highlights in the image below. Note that it will not rejoin at two points (marked with red highlights) where the lines OUGHT to be joined. But your regex sequence will have the same problem, too. To solve the second red highlight I use a regex search that specifically targets broken/incomplete opening/closing quotations. To solve the first red hightlight... as far as I can tell, you can't. Except by going line by line and fixing it manually. Incidentally, this last point goes to something i said in a related thread recently: this is one of many issues with trying to fix direct PDF-to-EPUB conversions. You're much better off running OCR on the original image and producing a pdf reference copy and a separate html/epub copy. Most OCR software will be smart enough to recognize the vast majority (90%) of correct paragraph breaks. EDIT: it corrupted the code again... and i've fixed it again. Hopefully it stays fixed this time... Last edited by ElMiko; 05-30-2026 at 07:45 AM.