Using regex to fix broken paragraph in Chinese

icearch · 05-27-2026, 04:46 AM

I have some thought, I need someone familiar with regex and text to see if this is doable.

The logic here is simple: I am not going to be 100% correct, just to get rid of annoying breaks.

Since chinses do not have space to identify words, and no capital to identify beginning of sentence, that leads me to think the other way round: What can be used to identify the ending of a sentence?

That be: punctuations!

So, I will regex search a punctuation and a line break right next to each other, that will be 99% the ending of a paragraph!

And the rest is easy.

So this is what I come up with to find the ending:

([\.．。\?？\!！>》\)）\]】}…:：—'"’”\|」』@])\n

and replace it with:

\1@@@\n

and replace:

@@@ to \n

Of course, you need to prepare the text first by removing excess space and empty lines.

So what do you guys think? Is there anything to improve?

icearch · 05-27-2026, 10:07 PM

I think this can apply to english too.

Start the same:

find:
([\.．。\?？\!！>》\)）\]】}…:：—'"’”\|」』@])\n

and replace it with:
\1@@@\n

and so we tagged the end lines.

then find:
([^@])\n

replace with:
\1###\n

so we tagged non-end lines.

and replace:

@@@ to \n
and
### to space

but we will leave hyphen next to a space, so replace it with non.

That's it, I think it works better than matching characters. Every line gets tagged either end of paragraph or not.

ElMiko · 05-29-2026, 08:59 AM

I have a similar approach, given that the start of the next paragraph is less determinative than the end of the preceding one with regards to rejoining incorrectly broken paragraphs.

But the search above looks like it would have thousands of false positives. Basically it's going to match the end of every single paragraph in the book. You might as well just search for . This seems... inefficient, no?

FWIW, I use:

Code:

([a-z]|[a-z]-|,|(?<!nbsp|&#160);|,”|[MD][rs]\.|Mrs\.|\b[AI]|(”|—|</i>)(?=</p>\s+<p[^>]*?>[a-z]))
</p>\s+<p[^>]*?>

and replace it with

Code:

\1(followed by a space)

The gobbledygood above basically is looking for paragraphs that end in:

any lowercase letter
any lowercase letter that is followed by a hyphen
a comma
a semicolon
a closing curly quote preceded by a comma
Honorifics (Dr., Mr., Ms., etc.)
Single capital letters that are also words (A and I)
closing curly quotes, closing tags, and em dashes that are followed by a paragraph that begins in a lower case letter.

This will not catch everything, obviously. And it'll rejoin things like verse which should not be rejoined. Which is why to JSWolf's overstated point in the other thread, any kind of automation like this needs to be combined with a quick visual page-by-page scan of the orginal doc (for example, a pdf), to find stragglers (mostly, where the last line of a page ends in a terminal punctuation, but it isn't actually the end of the paragraph) and to un-join verse and other idiosyncratically formatted blocks.

And some errors are just going to be unavoidable without an absurd (and unhelpful) level of obsessiveness... but this is true of physical media as well.

PS - this is also paired with dozens of other searches, some of which can help quickly identify other cases of incorrectly broken paragraphs, such as searching for quotations that haven't been appropriately closed. e.g.:

Code:

“Watch out! Stay away from there! It's not safe.

Stay close to me,” he said.

icearch · 05-29-2026, 09:20 AM

Quote:

Originally Posted by ElMiko

But the search above looks like it would have thousands of false positives. Basically it's going to match the end of every single paragraph in the book. You might as well just search for . This seems... inefficient, no?

To be clear, I'm using plain text to do the correciton. working this with lot's of tags would be a pain in the a$$.

icearch · 05-29-2026, 09:25 AM

Quote:

Originally Posted by ElMiko

PS - this is also paired with dozens of other searches, some of which can help quickly identify other cases of incorrectly broken paragraphs, such as searching for quotations that haven't been appropriately closed. e.g.:

Code:

“Watch out! Stay away from there! It's not safe.

Stay close to me,” he said.

yeeee... I didn't really think about quotations, but after second thought, my approach is to find the end, than make everything else to join, quotation would be fine.

That's why I choose to find the end. Because there just be way too much thing that can be joined. The problem with finding ends comes to when a broken just happens at a in-quotation punctuation.

But as we know, you can never be 100% sure, I think it's tolerable. At least it's how much a few lines of regex can do.

Turtle91 · 05-29-2026, 01:14 PM

Sigil’s ability to give a list of changes, along with a short excerpt of words before and after, and then allow you to quickly select (or deselect) which changes you wish… would make it easier to quickly check for those few outliers.

ElMiko · 05-29-2026, 01:19 PM

Quote:

Originally Posted by icearch

To be clear, I'm using plain text to do the correciton. working this with lot's of tags would be a pain in the a$$.

I mean, I guess what I'm saying then is that you could just as easily search for \p{P}\n and it's effectively doing the same thing that your original search is doing: namely, finding the end of every single line in the text file that ends in a punctuation. (NOTE: this is in the context of English punctuation; I haven't tested to what extent \p{P} matches non-English punctuation). It's still going to be thousands and thousands of matches.

one thing you could do is replace wherever i have "s+<p[^>]*?>" with "\n" and you should get the same effect in the context of a plain text file. (the italics tag will obviously be ignored).

In any event, one thing that your search doesn't capture, ironically, is lines that end in a letter... i.e. the most common type of broken line in pdf conversions. My search is super old so I was still using "[a-z]" for capturing lower case letters when i first started iterating it, but it'd probably be improved by using "\p{Ll}" (that's uppercase L, followed by lowercase L, not uppercase I). Or even just "\p{L}"...

icearch · 05-29-2026, 04:31 PM

Quote:

Originally Posted by ElMiko

I mean, I guess what I'm saying then is that you could just as easily search for \p{P}\n and it's effectively doing the same thing that your original search is doing: namely, finding the end of every single line in the text file that ends in a punctuation. (NOTE: this is in the context of English punctuation; I haven't tested to what extent \p{P} matches non-English punctuation). It's still going to be thousands and thousands of matches.

one thing you could do is replace wherever i have "s+<p[^>]*?>" with "\n" and you should get the same effect in the context of a plain text file. (the italics tag will obviously be ignored).

In any event, one thing that your search doesn't capture, ironically, is lines that end in a letter... i.e. the most common type of broken line in pdf conversions. My search is super old so I was still using "[a-z]" for capturing lower case letters when i first started iterating it, but it'd probably be improved by using "\p{Ll}" (that's uppercase L, followed by lowercase L, not uppercase I). Or even just "\p{L}"...

I mean... yes? I didn't quite understand what you are saying here.

You said that I didn't capture the most common type of broken line which is when it ends with characters, that is what I'm going to achieve.

Because I'm going to:

1. Find and mark any line that looks like an end line of a paragraph, that is the one end with punctuation. i.e. the non-broken line.

2. Mark that end line. So all broken lines are not marked.

3. Remove any \n, so everything merge into a giant paragraph, with special markings to indicate where every paragraph suppose to end.

4. Replace end markings with \n.

To not try to find the broken lines I avoided to distinguish all werid conditions, after all the ultimate goal is to re-arrange the paragraphs, merging all together and than separate them works fine too.

As to have thousands of result, yeees...? Finding broken ones will get tens times more result, so I really didn't get what you mean.

Hope the best.

ElMiko · 05-29-2026, 07:27 PM

Quote:

Originally Posted by icearch

You said that I didn't capture the most common type of broken line which is when it ends with characters, that is what I'm going to achieve.

That's not quite what I said.

I said your regex wouldn't capture the most common type of broken line: namely, the kind of line that ends in a letter.

For example:

Code:

Some day, when we have enough money, we will go to Disney

World.

If what you are suggesting is going through and manually marking off every paragraph break one-by-one, that's incredibly inefficient. I honestly think it would take a full day. A pretty average 300-page paperback will have around 1800 paragraphs and around 150-200 erroneous breaks in your standard OCR conversion (it'll be many times more if it's a straight pdf conversion. So at an absolute minimum, you're talking about evaluating, one-by-one around 2000 matches, and depending on the kind of conversion you're talking about it could be several times that.

This is simply not feasible. I don't just mean it's a lot of work. I mean it's impossible to do effectively. This is for the same reason airport security checkers become ineffective after about 20 minutes; your brain is just going to start "autocorrecting" based on expectations.

This is also why you need to find a way to automate the process as much as you can. In my experience (and I've done literally thousands of these), any single search that requires you to individually check more than 180 results is going to cause a noticeable quality drop-off in your final product. It's the difference between doing a readthrough and catching an error every 20-50 pages, and doing a readthrough and finding errors every second or fifth page. We're talking an order of magnitude.

I'm not here to yuck your yum. You floated a proposed solution, asking for feedback as to whether it's possible. The answer is: Yes, your search is not an inherently broken search. It should return every single instance of a line that ends in the punctuation marks you listed between those brackets (although, like I said, \p{P} may be more comprehensive as a regex solution than listing each punctuation mark that you can think of individually).

But, as to whether this is an effective way to correct erroneously broken lines, I think that the originally proposed regex solution in the first post has real issues. If the ONLY thing you had to worry about in a given document were correctly reflecting paragraph breaks, I'd STILL say this approach would be problematic. But when you consider that most files that require you to rejoin erroneously broken lines also have a whole host of other issues (often related to the OCR process), it is—strictly in my opinion—a misallocation of mental (and temporal) resources to spotcheck every single instance of a line ending in a punctuation mark.

icearch · 05-29-2026, 08:23 PM

I still didn't get what you mean, of course I'm not going to mark every end line of paragraph manually.

Consider our language barrier, I'm showing you with some random text.

1. This is some random text from novel pdf, it contains lines ends with lots of things.
And with my first regex to find any end-paragraph lines.

Click image for larger version

Name: 01.png
Views: 31
Size: 73.2 KB
ID: 223578

2. After first replace:

Click image for larger version

Name: 02.png
Views: 15
Size: 69.0 KB
ID: 223579

3. Get every line else with another tag:

Click image for larger version

Name: 03.png
Views: 12
Size: 64.7 KB
ID: 223584

4. Done with That:

Click image for larger version

Name: 04.png
Views: 7
Size: 69.5 KB
ID: 223585

5. Remove every \n:

Click image for larger version

Name: 05.png
Views: 7
Size: 71.9 KB
ID: 223586

6. After that:

Click image for larger version

Name: 06.png
Views: 7
Size: 48.5 KB
ID: 223587

7. Get desired \n back:

Click image for larger version

Name: 07.png
Views: 4
Size: 8.8 KB
ID: 223588

8. Result:

Click image for larger version

Name: 08.png
Views: 4
Size: 57.0 KB
ID: 223589

9. Get space back

Click image for larger version

Name: 09.png
Views: 4
Size: 9.7 KB
ID: 223590

10. Final result:

Click image for larger version

Name: 10.png
Views: 5
Size: 51.7 KB
ID: 223591

The rest is to place each paragraph in p tags.

I think it pertty much done what it should? I can't understand why you said it can't handle basic brokens that ends with letters.

As to why I need to come up with every punctuation instead of using {P}, that's because it can match :

1. former part of a pair, namly ( [ {

and

2. non-end things like : , ; -

and such. Which you don't want.

Which is highly possible when a broken line ends, and totally avoidable.

ElMiko · 05-29-2026, 11:00 PM

I do think the language barrier is part of what's at the heart of the confusion. What I said was that your original regex

Code:

([\.．。\?？\!！>》\)）\]】}…:：—'"’”\|」』@])\n

will not match any lines that end in a letter. And it doesn't. It can't. That search is exclusively for punctuation. One of the subsequent NINE steps (where you just delete all the "\n" matches) is what functionally connects lines that end in letters (along with everything else).

But the main point I was trying to make is that your approach requires you to go through and manually designate which lines should be paragraph breaks, one-by-one. At a book-scale (rather than less than a dozen lines in your example) that's going to be incredibly inefficient and time-consuming... and, as I said, massively error-prone because of the sheer volume of manual checks that you will have to perform.

You asked for feedback on the regex; I've explained why I think it's ineffecient. You think that performing 10 searches (one of which requires you to individually check literally thousands of matches in the span of a book) is more efficient for your workflow than performing 1. Fine. You think that it's better to type out each punctuation mark you want rather than writing exclusion regex for the handful that you don't want (e.g. "(\p{P})(?<![,:;-\[\({])\n"). Also, fine.

And when I say, "fine", I mean it sincerely. This is a hobby (at least for folks like you and me); we should do only what makes us happy. If your regex solution is contributing to your ebook creation being more enjoyable, then do it. Your approach may even change over time as your workflow evolves... or not! But when you asked for feedback, I gave it. As I said, I agree fundamentally with the view that when trying to join broken lines, looking for how a (or line, in your case) begins is probably less effective on the whole than looking for how it ends. I just don't fully endorse your particular approach. I know that for my workflow, I'd grow frustrated after cycling through just ONE chapter, let alone a whole book. But that's my workflow and my character.

---

P.S. I promise, what follows is my final attempt to articulate what I see as the fatal flaw in your approach:

In your example, the excerpt you used had 102 words. And your initial regex search matched 7 of the 8 lines that you included in your example.

If we extrapolate from that to a full length novel (the average wordcount for sci-fi/fantasy is 90-120 thousand words) you would have to manually check between 6,176 matches (i.e. 90,000 divided by 102 multiplied by 7) and 8,235 matches (i.e. 120,000 divided by 102 multiplied by 7). If, as you seem to be indicating, you are dealing with a PDF-to-EPUB conversion where every single pdf line is broken, this is the scale of match-by-match work that your regex will require.

icearch · Yesterday, 03:04 AM

That's why I'm confusing, I did not go through everyline with my hand one by one! I just don't know why do you think that way!

And did I do something wrong? Because running your code does not work in Sigil, putting it through regex check it says multiple error.

Click image for larger version

Name: 21.png
Views: 9
Size: 164.9 KB
ID: 223593

Click image for larger version

Name: 22.png
Views: 8
Size: 208.5 KB
ID: 223594

ps.

I'm very greatful for your feed back, and genually hope the best of every one, but since I just can't get what you are saying, so I keep explaining. I do not want the broken lines to be fixed 100% correct, I just want all paragraph to be somewhat whole and readable.

I do not do marking them one by one, and I do not check them one by one. I ran my code through some full length chinese novel and it seems to be fine. Did not tried full length english though.

ElMiko · Yesterday, 07:19 AM

When you mark a line with @@@, you are indicating that it's a paragraph break, right? (this is so that you can re-break it at the end of your regex cycle.)

You remember my example of dialogue that's been broken at a terminal punctuation?

Code:

“Watch out! Stay away from there! It's not safe.

Stay close to me,” he said.

or how about:

Code:

“Watch out! Stay away from there! It's not safe,”

he warned, before turning around and running away.

or how about:

Code:

At this point, I looked beseechingly at Mr.

Jones, and said, “He left us!”

In all three of the cases above, doing a bulk search and replace with your regex will add "@@@" to the end of the first line. Which means that when you do the final step of your regex sequence (replacing "@@@" with "\n") you'll just end up rebreaking what should be an unbroken paragraph.

The issue with the assumption built into your original regex (if you don't have a separate regex to deal with the kinds of situations I outlined above) is that it incorrectly treats all terminal punctuation marks as necessarily preceding a paragraph break, and it treat several kinds of punctuation marks as terminal when they aren't necessarily terminal at all (e.g. ” does not always denote the end of a sentence, much less the end of a paragraph).

The only way to avoid this would be to check one-by-one, inserting the "@@@" manually when you determine that it really IS the end of a paragraph. And doing THAT would take forever.

---

For reasons that escape me, my regex got corrupted at some point on MobileReads, and it replaced some elements with asterisks that aren’t in the original regex. In any event, the original regex I shared with you wouldn't work because it's desinged for html, not plain text. (also, I'm not sure what text editor you're using... I use Sigil... but then again I only edit in html.)

I modified it for plain text here:

Code:

(\p{Ll}|\p{Ll}-|,|(?<!nbsp|&#160);|,”|[MD][rs]\.|Mrs\.|\b[AI]|(”|—|</i>)(?=\n\p{Ll}))\n

when paired with a replace value of "\1[space])" (as in an actual blank space, not the text "[space]" it will rejoin the following text at the yellow highlights in the image below.

Click image for larger version

Name: RejoiningLines.jpg
Views: 7
Size: 118.7 KB
ID: 223598

Note that it will not rejoin at two points (marked with red highlights) where the lines OUGHT to be joined. But your regex sequence will have the same problem, too.

To solve the second red highlight I use a regex search that specifically targets broken/incomplete opening/closing quotations.

To solve the first red hightlight... as far as I can tell, you can't. Except by going line by line and fixing it manually.

Incidentally, this last point goes to something i said in a related thread recently: this is one of many issues with trying to fix direct PDF-to-EPUB conversions. You're much better off running OCR on the original image and producing a pdf reference copy and a separate html/epub copy. Most OCR software will be smart enough to recognize the vast majority (90%) of correct paragraph breaks.

EDIT: it corrupted the code again... and i've fixed it again. Hopefully it stays fixed this time...

icearch · Yesterday, 09:27 AM

So sad, your code still does not work correctly, maybe try paste it in txt and attach it to the post. I'd like to try it.

You made some good point about having period at a non-end line would cause some problem. My regex surly do not cover this kind of things.

And maybe in quotation is fixable, I can add another one targeting broken quotations, fix them first. But I think, together with two quotations in a row, it does not affect reading flow that much.

Mr. does annoying very much. I need to cover that.

As to one regex to fix them all, I do think break them apart can reduce errors, and Sigil have automation sequence. But that's personal taste.

As to something inevitable, I can only say that it's the fault of old time or poorly formatting choice. We do not have to punish ourself with that.

ElMiko · Yesterday, 11:21 AM

Quote:

Originally Posted by icearch

So sad, your code still does not work correctly, maybe try paste it in txt and attach it to the post. I'd like to try it.

See the attached txt file. As I said before, I wasn't able to discern from your screenshots what text editor you were using, but that may be contributing to the issue if they are using a different regex engine (or different version of the same regex engine).

Also, FYI, for Chinese characters you would probably want to replace every instance in my search of "\p{Ll}" with either "\p{Han}" or \w (depending on the regex engine). Also, my memory of chinese grammar and punctuation is 20 years stale, so I'm not sure to what degree all the same punctuation assumptions built into my regex will apply to Chinese style guides.

Quote:

You made some good point about having period at a non-end line would cause some problem. My regex surly do not cover this kind of things.

And neither does mine, 100% of the time. As I said, I don't think there's a solution other than being completely obssessive about it, and I'm just not that obsessive!

Quote:

And maybe in quotation is fixable, I can add another one targeting broken quotations, fix them first. But I think, together with two quotations in a row, it does not affect reading flow that much.

Like I said, I also have a separate search that looks for potentially incomplete quotes, although I run it after my global "re-joining" search.

Quote:

Mr. does annoying very much. I need to cover that.

See also, Dr., Ms. Mrs.... technically there are other honorifics (mostly military), but I just haven't gone through the trouble of adding them to my regex.

Quote:

As to one regex to fix them all, I do think break them apart can reduce errors, and Sigil have automation sequence. But that's personal taste.

Absolutely. Trying to do too much with one regex is a recipe for disaster. Every time I polish a book, it entails dozens of regex searches (including one catch-all search that is basically a combination of over a hundred discreet common OCR error patterns). It's a constant learning process to figure where the line is between too many searches and not enough... For me it's usually about how many matches any given search returns. If it's returning more than 200 matches, I usually means I need to go back and modify it because its being too greedy.

05-27-2026, 10:07 PM	#2
icearch Groupie Posts: 156 Karma: 2000 Join Date: Nov 2025 Device: none	I think this can apply to english too. Start the same: find: ([\.．。\?？\!！>》\)）\]】}…:：—'"’”\\|」』@])\n and replace it with: \1@@@\n and so we tagged the end lines. then find: ([^@])\n replace with: \1###\n so we tagged non-end lines. and replace: @@@ to \n and ### to space but we will leave hyphen next to a space, so replace it with non. That's it, I think it works better than matching characters. Every line gets tagged either end of paragraph or not. Last edited by icearch; 05-27-2026 at 10:13 PM.

05-29-2026, 08:59 AM	#3
ElMiko Fanatic Posts: 569 Karma: 65460 Join Date: Jun 2011 Device: Kindle Voyage, Boox Go 7	I have a similar approach, given that the start of the next paragraph is less determinative than the end of the preceding one with regards to rejoining incorrectly broken paragraphs. But the search above looks like it would have thousands of false positives. Basically it's going to match the end of every single paragraph in the book. You might as well just search for </p>. This seems... inefficient, no? FWIW, I use: Code: ([a-z]\|[a-z]-\|,\|(?<!nbsp\|&#160);\|,”\|[MD][rs]\.\|Mrs\.\|\b[AI]\|(”\|—\|</i>)(?=</p>\s+<p[^>]?>[a-z])) </p>\s+<p[^>]?> and replace it with Code: \1(followed by a space) The gobbledygood above basically is looking for paragraphs that end in: any lowercase letter any lowercase letter that is followed by a hyphen a comma a semicolon a closing curly quote preceded by a comma Honorifics (Dr., Mr., Ms., etc.) Single capital letters that are also words (A and I) closing curly quotes, closing <i> tags, and em dashes that are followed by a paragraph that begins in a lower case letter. This will not catch everything, obviously. And it'll rejoin things like verse which should not be rejoined. Which is why to JSWolf's overstated point in the other thread, any kind of automation like this needs to be combined with a quick visual page-by-page scan of the orginal doc (for example, a pdf), to find stragglers (mostly, where the last line of a page ends in a terminal punctuation, but it isn't actually the end of the paragraph) and to un-join verse and other idiosyncratically formatted blocks. And some errors are just going to be unavoidable without an absurd (and unhelpful) level of obsessiveness... but this is true of physical media as well. PS - this is also paired with dozens of other searches, some of which can help quickly identify other cases of incorrectly broken paragraphs, such as searching for quotations that haven't been appropriately closed. e.g.: Code: “Watch out! Stay away from there! It's not safe. Stay close to me,” he said. Last edited by ElMiko; Yesterday at 06:18 AM.

05-29-2026, 08:23 PM	#10
icearch Groupie Posts: 156 Karma: 2000 Join Date: Nov 2025 Device: none	I still didn't get what you mean, of course I'm not going to mark every end line of paragraph manually. Consider our language barrier, I'm showing you with some random text. 1. This is some random text from novel pdf, it contains lines ends with lots of things. And with my first regex to find any end-paragraph lines. 2. After first replace: 3. Get every line else with another tag: 4. Done with That: 5. Remove every \n: 6. After that: 7. Get desired \n back: 8. Result: 9. Get space back 10. Final result: The rest is to place each paragraph in p tags. I think it pertty much done what it should? I can't understand why you said it can't handle basic brokens that ends with letters. As to why I need to come up with every punctuation instead of using {P}, that's because it can match : 1. former part of a pair, namly ( [ { and 2. non-end things like : , ; - and such. Which you don't want. Which is highly possible when a broken line ends, and totally avoidable. Last edited by icearch; 05-29-2026 at 10:22 PM.

05-29-2026, 11:00 PM	#11
ElMiko Fanatic Posts: 569 Karma: 65460 Join Date: Jun 2011 Device: Kindle Voyage, Boox Go 7	I do think the language barrier is part of what's at the heart of the confusion. What I said was that your original regex Code: ([\.．。\?？\!！>》\)）\]】}…:：—'"’”\\|」』@])\n will not match any lines that end in a letter. And it doesn't. It can't. That search is exclusively for punctuation. One of the subsequent NINE steps (where you just delete all the "\n" matches) is what functionally connects lines that end in letters (along with everything else). But the main point I was trying to make is that your approach requires you to go through and manually designate which lines should be paragraph breaks, one-by-one. At a book-scale (rather than less than a dozen lines in your example) that's going to be incredibly inefficient and time-consuming... and, as I said, massively error-prone because of the sheer volume of manual checks that you will have to perform. You asked for feedback on the regex; I've explained why I think it's ineffecient. You think that performing 10 searches (one of which requires you to individually check literally thousands of matches in the span of a book) is more efficient for your workflow than performing 1. Fine. You think that it's better to type out each punctuation mark you want rather than writing exclusion regex for the handful that you don't want (e.g. "(\p{P})(?<![,:;-\[\({])\n"). Also, fine. And when I say, "fine", I mean it sincerely. This is a hobby (at least for folks like you and me); we should do only what makes us happy. If your regex solution is contributing to your ebook creation being more enjoyable, then do it. Your approach may even change over time as your workflow evolves... or not! But when you asked for feedback, I gave it. As I said, I agree fundamentally with the view that when trying to join broken lines, looking for how a <p> (or line, in your case) begins is probably less effective on the whole than looking for how it ends. I just don't fully endorse your particular approach. I know that for my workflow, I'd grow frustrated after cycling through just ONE chapter, let alone a whole book. But that's my workflow and my character. --- P.S. I promise, what follows is my final attempt to articulate what I see as the fatal flaw in your approach: In your example, the excerpt you used had 102 words. And your initial regex search matched 7 of the 8 lines that you included in your example. If we extrapolate from that to a full length novel (the average wordcount for sci-fi/fantasy is 90-120 thousand words) you would have to manually check between 6,176 matches (i.e. 90,000 divided by 102 multiplied by 7) and 8,235 matches (i.e. 120,000 divided by 102 multiplied by 7). If, as you seem to be indicating, you are dealing with a PDF-to-EPUB conversion where every single pdf line is broken, this is the scale of match-by-match work that your regex will require. Last edited by ElMiko; 05-29-2026 at 11:32 PM.

Yesterday, 03:04 AM	#12
icearch Groupie Posts: 156 Karma: 2000 Join Date: Nov 2025 Device: none	That's why I'm confusing, I did not go through everyline with my hand one by one! I just don't know why do you think that way! And did I do something wrong? Because running your code does not work in Sigil, putting it through regex check it says multiple error. ps. I'm very greatful for your feed back, and genually hope the best of every one, but since I just can't get what you are saying, so I keep explaining. I do not want the broken lines to be fixed 100% correct, I just want all paragraph to be somewhat whole and readable. I do not do marking them one by one, and I do not check them one by one. I ran my code through some full length chinese novel and it seems to be fine. Did not tried full length english though. Last edited by icearch; Yesterday at 03:20 AM.

05-27-2026, 04:46 AM	#1
icearch Groupie Posts: 156 Karma: 2000 Join Date: Nov 2025 Device: none	Using regex to fix broken paragraph in Chinese I have some thought, I need someone familiar with regex and text to see if this is doable. The logic here is simple: I am not going to be 100% correct, just to get rid of annoying breaks. Since chinses do not have space to identify words, and no capital to identify beginning of sentence, that leads me to think the other way round: What can be used to identify the ending of a sentence? That be: punctuations! So, I will regex search a punctuation and a line break right next to each other, that will be 99% the ending of a paragraph! And the rest is easy. So this is what I come up with to find the ending: ([\.．。\?？\!！>》\)）\]】}…:：—'"’”\\|」』@])\n and replace it with: \1@@@\n and replace: @@@ to \n Of course, you need to prepare the text first by removing excess space and empty lines. So what do you guys think? Is there anything to improve?

05-29-2026, 01:14 PM	#6
Turtle91 A Hairy Wizard Posts: 3,488 Karma: 21099999 Join Date: Dec 2012 Location: Charleston, SC today Device: iPhone 15/11/X/6/iPad 1,2,Air & Air Pro/Surface Pro/Kindle PW & Fire	Sigil’s ability to give a list of changes, along with a short excerpt of words before and after, and then allow you to quickly select (or deselect) which changes you wish… would make it easier to quickly check for those few outliers.

Yesterday, 07:19 AM	#13
ElMiko Fanatic Posts: 569 Karma: 65460 Join Date: Jun 2011 Device: Kindle Voyage, Boox Go 7	When you mark a line with @@@, you are indicating that it's a paragraph break, right? (this is so that you can re-break it at the end of your regex cycle.) You remember my example of dialogue that's been broken at a terminal punctuation? Code: “Watch out! Stay away from there! It's not safe. Stay close to me,” he said. or how about: Code: “Watch out! Stay away from there! It's not safe,” he warned, before turning around and running away. or how about: Code: At this point, I looked beseechingly at Mr. Jones, and said, “He left us!” In all three of the cases above, doing a bulk search and replace with your regex will add "@@@" to the end of the first line. Which means that when you do the final step of your regex sequence (replacing "@@@" with "\n") you'll just end up rebreaking what should be an unbroken paragraph. The issue with the assumption built into your original regex (if you don't have a separate regex to deal with the kinds of situations I outlined above) is that it incorrectly treats all terminal punctuation marks as necessarily preceding a paragraph break, and it treat several kinds of punctuation marks as terminal when they aren't necessarily terminal at all (e.g. ” does not always denote the end of a sentence, much less the end of a paragraph). The only way to avoid this would be to check one-by-one, inserting the "@@@" manually when you determine that it really IS the end of a paragraph. And doing THAT would take forever. --- For reasons that escape me, my regex got corrupted at some point on MobileReads, and it replaced some elements with asterisks that aren’t in the original regex. In any event, the original regex I shared with you wouldn't work because it's desinged for html, not plain text. (also, I'm not sure what text editor you're using... I use Sigil... but then again I only edit in html.) I modified it for plain text here: Code: (\p{Ll}\|\p{Ll}-\|,\|(?<!nbsp\|&#160);\|,”\|[MD][rs]\.\|Mrs\.\|\b[AI]\|(”\|—\|</i>)(?=\n\p{Ll}))\n when paired with a replace value of "\1[space])" (as in an actual blank space, not the text "[space]" it will rejoin the following text at the yellow highlights in the image below. Note that it will not rejoin at two points (marked with red highlights) where the lines OUGHT to be joined. But your regex sequence will have the same problem, too. To solve the second red highlight I use a regex search that specifically targets broken/incomplete opening/closing quotations. To solve the first red hightlight... as far as I can tell, you can't. Except by going line by line and fixing it manually. Incidentally, this last point goes to something i said in a related thread recently: this is one of many issues with trying to fix direct PDF-to-EPUB conversions. You're much better off running OCR on the original image and producing a pdf reference copy and a separate html/epub copy. Most OCR software will be smart enough to recognize the vast majority (90%) of correct paragraph breaks. EDIT: it corrupted the code again... and i've fixed it again. Hopefully it stays fixed this time... Last edited by ElMiko; Yesterday at 07:45 AM.

Yesterday, 09:27 AM	#14
icearch Groupie Posts: 156 Karma: 2000 Join Date: Nov 2025 Device: none	So sad, your code still does not work correctly, maybe try paste it in txt and attach it to the post. I'd like to try it. You made some good point about having period at a non-end line would cause some problem. My regex surly do not cover this kind of things. And maybe in quotation is fixable, I can add another one targeting broken quotations, fix them first. But I think, together with two quotations in a row, it does not affect reading flow that much. Mr. does annoying very much. I need to cover that. As to one regex to fix them all, I do think break them apart can reduce errors, and Sigil have automation sequence. But that's personal taste. As to something inevitable, I can only say that it's the fault of old time or poorly formatting choice. We do not have to punish ourself with that.

Thread Tools	Search this Thread
Show Printable Version Email this Page	Search this Thread: Advanced Search

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
False paragraph breaks & RegEx	ColMac	Editor	9	10-21-2022 03:00 PM
Paragraph Regex	FDPuthuff	Sigil	2	09-27-2020 12:38 PM
How can I fix it when every line is a paragraph?	Nyssa	Editor	30	12-23-2014 08:23 PM
regex puzzle: finding paragraph before...	cybmole	Sigil	8	02-24-2012 09:06 AM
Chapters are one giant paragraph. How to fix?	bfollowell	Conversion	9	02-03-2011 01:20 PM