|
|
#1 |
|
Groupie
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 156
Karma: 2000
Join Date: Nov 2025
Device: none
|
Using regex to fix broken paragraph in Chinese
I have some thought, I need someone familiar with regex and text to see if this is doable.
The logic here is simple: I am not going to be 100% correct, just to get rid of annoying breaks. Since chinses do not have space to identify words, and no capital to identify beginning of sentence, that leads me to think the other way round: What can be used to identify the ending of a sentence? That be: punctuations! So, I will regex search a punctuation and a line break right next to each other, that will be 99% the ending of a paragraph! And the rest is easy. So this is what I come up with to find the ending: ([\..。\??\!!>》\))\]】}…::—'"’”\|」』@])\n and replace it with: \1@@@\n and replace: @@@ to \n Of course, you need to prepare the text first by removing excess space and empty lines. So what do you guys think? Is there anything to improve? |
|
|
|
|
|
#2 |
|
Groupie
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 156
Karma: 2000
Join Date: Nov 2025
Device: none
|
I think this can apply to english too.
Start the same: find: ([\..。\??\!!>》\))\]】}…::—'"’”\|」』@])\n and replace it with: \1@@@\n and so we tagged the end lines. then find: ([^@])\n replace with: \1###\n so we tagged non-end lines. and replace: @@@ to \n and ### to space but we will leave hyphen next to a space, so replace it with non. That's it, I think it works better than matching characters. Every line gets tagged either end of paragraph or not. Last edited by icearch; 05-27-2026 at 10:13 PM. |
|
|
|
|
|
#3 |
|
Fanatic
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 569
Karma: 65460
Join Date: Jun 2011
Device: Kindle Voyage, Boox Go 7
|
I have a similar approach, given that the start of the next paragraph is less determinative than the end of the preceding one with regards to rejoining incorrectly broken paragraphs.
But the search above looks like it would have thousands of false positives. Basically it's going to match the end of every single paragraph in the book. You might as well just search for </p>. This seems... inefficient, no? FWIW, I use: Code:
([a-z]|[a-z]-|,|(?<!nbsp| );|,”|[MD][rs]\.|Mrs\.|\b[AI]|(”|—|</i>)(?=</p>\s+<p[^>]*?>[a-z])) </p>\s+<p[^>]*?> Code:
\1(followed by a space)
This will not catch everything, obviously. And it'll rejoin things like verse which should not be rejoined. Which is why to JSWolf's overstated point in the other thread, any kind of automation like this needs to be combined with a quick visual page-by-page scan of the orginal doc (for example, a pdf), to find stragglers (mostly, where the last line of a page ends in a terminal punctuation, but it isn't actually the end of the paragraph) and to un-join verse and other idiosyncratically formatted blocks. And some errors are just going to be unavoidable without an absurd (and unhelpful) level of obsessiveness... but this is true of physical media as well. PS - this is also paired with dozens of other searches, some of which can help quickly identify other cases of incorrectly broken paragraphs, such as searching for quotations that haven't been appropriately closed. e.g.: Code:
“Watch out! Stay away from there! It's not safe. Stay close to me,” he said. Last edited by ElMiko; Yesterday at 06:18 AM. |
|
|
|
|
|
#4 |
|
Groupie
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 156
Karma: 2000
Join Date: Nov 2025
Device: none
|
To be clear, I'm using plain text to do the correciton. working this with lot's of tags would be a pain in the a$$.
|
|
|
|
|
|
#5 | |
|
Groupie
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 156
Karma: 2000
Join Date: Nov 2025
Device: none
|
Quote:
That's why I choose to find the end. Because there just be way too much thing that can be joined. The problem with finding ends comes to when a broken just happens at a in-quotation punctuation. But as we know, you can never be 100% sure, I think it's tolerable. At least it's how much a few lines of regex can do. Last edited by icearch; 05-29-2026 at 09:32 AM. |
|
|
|
|
|
|
#6 |
|
A Hairy Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 3,488
Karma: 21099999
Join Date: Dec 2012
Location: Charleston, SC today
Device: iPhone 15/11/X/6/iPad 1,2,Air & Air Pro/Surface Pro/Kindle PW & Fire
|
Sigil’s ability to give a list of changes, along with a short excerpt of words before and after, and then allow you to quickly select (or deselect) which changes you wish… would make it easier to quickly check for those few outliers.
|
|
|
|
|
|
#7 | |
|
Fanatic
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 569
Karma: 65460
Join Date: Jun 2011
Device: Kindle Voyage, Boox Go 7
|
Quote:
one thing you could do is replace wherever i have "</p>s+<p[^>]*?>" with "\n" and you should get the same effect in the context of a plain text file. (the italics tag will obviously be ignored). In any event, one thing that your search doesn't capture, ironically, is lines that end in a letter... i.e. the most common type of broken line in pdf conversions. My search is super old so I was still using "[a-z]" for capturing lower case letters when i first started iterating it, but it'd probably be improved by using "\p{Ll}" (that's uppercase L, followed by lowercase L, not uppercase I). Or even just "\p{L}"... Last edited by ElMiko; 05-29-2026 at 01:47 PM. |
|
|
|
|
|
|
#8 | |
|
Groupie
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 156
Karma: 2000
Join Date: Nov 2025
Device: none
|
Quote:
You said that I didn't capture the most common type of broken line which is when it ends with characters, that is what I'm going to achieve. Because I'm going to: 1. Find and mark any line that looks like an end line of a paragraph, that is the one end with punctuation. i.e. the non-broken line. 2. Mark that end line. So all broken lines are not marked. 3. Remove any \n, so everything merge into a giant paragraph, with special markings to indicate where every paragraph suppose to end. 4. Replace end markings with \n. To not try to find the broken lines I avoided to distinguish all werid conditions, after all the ultimate goal is to re-arrange the paragraphs, merging all together and than separate them works fine too. As to have thousands of result, yeees...? Finding broken ones will get tens times more result, so I really didn't get what you mean. Hope the best. Last edited by icearch; 05-29-2026 at 04:34 PM. |
|
|
|
|
|
|
#9 | |
|
Fanatic
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 569
Karma: 65460
Join Date: Jun 2011
Device: Kindle Voyage, Boox Go 7
|
Quote:
I said your regex wouldn't capture the most common type of broken line: namely, the kind of line that ends in a letter. For example: Code:
Some day, when we have enough money, we will go to Disney World. This is simply not feasible. I don't just mean it's a lot of work. I mean it's impossible to do effectively. This is for the same reason airport security checkers become ineffective after about 20 minutes; your brain is just going to start "autocorrecting" based on expectations. This is also why you need to find a way to automate the process as much as you can. In my experience (and I've done literally thousands of these), any single search that requires you to individually check more than 180 results is going to cause a noticeable quality drop-off in your final product. It's the difference between doing a readthrough and catching an error every 20-50 pages, and doing a readthrough and finding errors every second or fifth page. We're talking an order of magnitude. I'm not here to yuck your yum. You floated a proposed solution, asking for feedback as to whether it's possible. The answer is: Yes, your search is not an inherently broken search. It should return every single instance of a line that ends in the punctuation marks you listed between those brackets (although, like I said, \p{P} may be more comprehensive as a regex solution than listing each punctuation mark that you can think of individually). But, as to whether this is an effective way to correct erroneously broken lines, I think that the originally proposed regex solution in the first post has real issues. If the ONLY thing you had to worry about in a given document were correctly reflecting paragraph breaks, I'd STILL say this approach would be problematic. But when you consider that most files that require you to rejoin erroneously broken lines also have a whole host of other issues (often related to the OCR process), it is—strictly in my opinion—a misallocation of mental (and temporal) resources to spotcheck every single instance of a line ending in a punctuation mark. Last edited by ElMiko; 05-29-2026 at 07:44 PM. |
|
|
|
|
|
|
#10 |
|
Groupie
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 156
Karma: 2000
Join Date: Nov 2025
Device: none
|
I still didn't get what you mean, of course I'm not going to mark every end line of paragraph manually.
Consider our language barrier, I'm showing you with some random text. 1. This is some random text from novel pdf, it contains lines ends with lots of things. And with my first regex to find any end-paragraph lines. 2. After first replace: 3. Get every line else with another tag: 4. Done with That: 5. Remove every \n: 6. After that: 7. Get desired \n back: 8. Result: 9. Get space back 10. Final result: The rest is to place each paragraph in p tags. I think it pertty much done what it should? I can't understand why you said it can't handle basic brokens that ends with letters. As to why I need to come up with every punctuation instead of using {P}, that's because it can match : 1. former part of a pair, namly ( [ { and 2. non-end things like : , ; - and such. Which you don't want. Which is highly possible when a broken line ends, and totally avoidable. Last edited by icearch; 05-29-2026 at 10:22 PM. |
|
|
|
|
|
#11 |
|
Fanatic
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 569
Karma: 65460
Join Date: Jun 2011
Device: Kindle Voyage, Boox Go 7
|
I do think the language barrier is part of what's at the heart of the confusion. What I said was that your original regex
Code:
([\..。\??\!!>》\))\]】}…::—'"’”\|」』@])\n But the main point I was trying to make is that your approach requires you to go through and manually designate which lines should be paragraph breaks, one-by-one. At a book-scale (rather than less than a dozen lines in your example) that's going to be incredibly inefficient and time-consuming... and, as I said, massively error-prone because of the sheer volume of manual checks that you will have to perform. You asked for feedback on the regex; I've explained why I think it's ineffecient. You think that performing 10 searches (one of which requires you to individually check literally thousands of matches in the span of a book) is more efficient for your workflow than performing 1. Fine. You think that it's better to type out each punctuation mark you want rather than writing exclusion regex for the handful that you don't want (e.g. "(\p{P})(?<![,:;-\[\({])\n"). Also, fine. And when I say, "fine", I mean it sincerely. This is a hobby (at least for folks like you and me); we should do only what makes us happy. If your regex solution is contributing to your ebook creation being more enjoyable, then do it. Your approach may even change over time as your workflow evolves... or not! But when you asked for feedback, I gave it. As I said, I agree fundamentally with the view that when trying to join broken lines, looking for how a <p> (or line, in your case) begins is probably less effective on the whole than looking for how it ends. I just don't fully endorse your particular approach. I know that for my workflow, I'd grow frustrated after cycling through just ONE chapter, let alone a whole book. But that's my workflow and my character. --- P.S. I promise, what follows is my final attempt to articulate what I see as the fatal flaw in your approach: In your example, the excerpt you used had 102 words. And your initial regex search matched 7 of the 8 lines that you included in your example. If we extrapolate from that to a full length novel (the average wordcount for sci-fi/fantasy is 90-120 thousand words) you would have to manually check between 6,176 matches (i.e. 90,000 divided by 102 multiplied by 7) and 8,235 matches (i.e. 120,000 divided by 102 multiplied by 7). If, as you seem to be indicating, you are dealing with a PDF-to-EPUB conversion where every single pdf line is broken, this is the scale of match-by-match work that your regex will require. Last edited by ElMiko; 05-29-2026 at 11:32 PM. |
|
|
|
|
|
#12 |
|
Groupie
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 156
Karma: 2000
Join Date: Nov 2025
Device: none
|
That's why I'm confusing, I did not go through everyline with my hand one by one! I just don't know why do you think that way!
And did I do something wrong? Because running your code does not work in Sigil, putting it through regex check it says multiple error. ps. I'm very greatful for your feed back, and genually hope the best of every one, but since I just can't get what you are saying, so I keep explaining. I do not want the broken lines to be fixed 100% correct, I just want all paragraph to be somewhat whole and readable. I do not do marking them one by one, and I do not check them one by one. I ran my code through some full length chinese novel and it seems to be fine. Did not tried full length english though. Last edited by icearch; Yesterday at 03:20 AM. |
|
|
|
|
|
#13 |
|
Fanatic
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 569
Karma: 65460
Join Date: Jun 2011
Device: Kindle Voyage, Boox Go 7
|
When you mark a line with @@@, you are indicating that it's a paragraph break, right? (this is so that you can re-break it at the end of your regex cycle.)
You remember my example of dialogue that's been broken at a terminal punctuation? Code:
“Watch out! Stay away from there! It's not safe. Stay close to me,” he said. Code:
“Watch out! Stay away from there! It's not safe,” he warned, before turning around and running away. Code:
At this point, I looked beseechingly at Mr. Jones, and said, “He left us!” The issue with the assumption built into your original regex (if you don't have a separate regex to deal with the kinds of situations I outlined above) is that it incorrectly treats all terminal punctuation marks as necessarily preceding a paragraph break, and it treat several kinds of punctuation marks as terminal when they aren't necessarily terminal at all (e.g. ” does not always denote the end of a sentence, much less the end of a paragraph). The only way to avoid this would be to check one-by-one, inserting the "@@@" manually when you determine that it really IS the end of a paragraph. And doing THAT would take forever. --- For reasons that escape me, my regex got corrupted at some point on MobileReads, and it replaced some elements with asterisks that aren’t in the original regex. In any event, the original regex I shared with you wouldn't work because it's desinged for html, not plain text. (also, I'm not sure what text editor you're using... I use Sigil... but then again I only edit in html.) I modified it for plain text here: Code:
(\p{Ll}|\p{Ll}-|,|(?<!nbsp| );|,”|[MD][rs]\.|Mrs\.|\b[AI]|(”|—|</i>)(?=\n\p{Ll}))\n
Note that it will not rejoin at two points (marked with red highlights) where the lines OUGHT to be joined. But your regex sequence will have the same problem, too. To solve the second red highlight I use a regex search that specifically targets broken/incomplete opening/closing quotations. To solve the first red hightlight... as far as I can tell, you can't. Except by going line by line and fixing it manually. Incidentally, this last point goes to something i said in a related thread recently: this is one of many issues with trying to fix direct PDF-to-EPUB conversions. You're much better off running OCR on the original image and producing a pdf reference copy and a separate html/epub copy. Most OCR software will be smart enough to recognize the vast majority (90%) of correct paragraph breaks. EDIT: it corrupted the code again... and i've fixed it again. Hopefully it stays fixed this time... Last edited by ElMiko; Yesterday at 07:45 AM. |
|
|
|
|
|
#14 |
|
Groupie
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 156
Karma: 2000
Join Date: Nov 2025
Device: none
|
So sad, your code still does not work correctly, maybe try paste it in txt and attach it to the post. I'd like to try it.
You made some good point about having period at a non-end line would cause some problem. My regex surly do not cover this kind of things. And maybe in quotation is fixable, I can add another one targeting broken quotations, fix them first. But I think, together with two quotations in a row, it does not affect reading flow that much. Mr. does annoying very much. I need to cover that. As to one regex to fix them all, I do think break them apart can reduce errors, and Sigil have automation sequence. But that's personal taste. As to something inevitable, I can only say that it's the fault of old time or poorly formatting choice. We do not have to punish ourself with that. |
|
|
|
|
|
#15 | |||||
|
Fanatic
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 569
Karma: 65460
Join Date: Jun 2011
Device: Kindle Voyage, Boox Go 7
|
Quote:
Also, FYI, for Chinese characters you would probably want to replace every instance in my search of "\p{Ll}" with either "\p{Han}" or \w (depending on the regex engine). Also, my memory of chinese grammar and punctuation is 20 years stale, so I'm not sure to what degree all the same punctuation assumptions built into my regex will apply to Chinese style guides. Quote:
Quote:
Quote:
Quote:
Last edited by ElMiko; Yesterday at 11:34 AM. |
|||||
|
|
|
![]() |
| Thread Tools | Search this Thread |
|
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| False paragraph breaks & RegEx | ColMac | Editor | 9 | 10-21-2022 03:00 PM |
| Paragraph Regex | FDPuthuff | Sigil | 2 | 09-27-2020 12:38 PM |
| How can I fix it when every line is a paragraph? | Nyssa | Editor | 30 | 12-23-2014 08:23 PM |
| regex puzzle: finding paragraph before... | cybmole | Sigil | 8 | 02-24-2012 09:06 AM |
| Chapters are one giant paragraph. How to fix? | bfollowell | Conversion | 9 | 02-03-2011 01:20 PM |