|
|
#1 |
|
Zealot
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 143
Karma: 2000
Join Date: Nov 2025
Device: none
|
Using regex to fix broken paragraph in Chinese
I have some thought, I need someone familiar with regex and text to see if this is doable.
The logic here is simple: I am not going to be 100% correct, just to get rid of annoying breaks. Since chinses do not have space to identify words, and no capital to identify beginning of sentence, that leads me to think the other way round: What can be used to identify the ending of a sentence? That be: punctuations! So, I will regex search a punctuation and a line break right next to each other, that will be 99% the ending of a paragraph! And the rest is easy. So this is what I come up with to find the ending: ([\..。\??\!!>》\))\]】}…::—'"’”\|」』@])\n and replace it with: \1@@@\n and replace: @@@ to \n Of course, you need to prepare the text first by removing excess space and empty lines. So what do you guys think? Is there anything to improve? |
|
|
|
![]() |
|
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| False paragraph breaks & RegEx | ColMac | Editor | 9 | 10-21-2022 03:00 PM |
| Paragraph Regex | FDPuthuff | Sigil | 2 | 09-27-2020 12:38 PM |
| How can I fix it when every line is a paragraph? | Nyssa | Editor | 30 | 12-23-2014 08:23 PM |
| regex puzzle: finding paragraph before... | cybmole | Sigil | 8 | 02-24-2012 09:06 AM |
| Chapters are one giant paragraph. How to fix? | bfollowell | Conversion | 9 | 02-03-2011 01:20 PM |