Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Formats > ePub

Notices

Reply
 
Thread Tools Search this Thread
Old Today, 04:46 AM   #1
icearch
Zealot
icearch knows what time it isicearch knows what time it isicearch knows what time it isicearch knows what time it isicearch knows what time it isicearch knows what time it isicearch knows what time it isicearch knows what time it isicearch knows what time it isicearch knows what time it isicearch knows what time it is
 
Posts: 143
Karma: 2000
Join Date: Nov 2025
Device: none
Using regex to fix broken paragraph in Chinese

I have some thought, I need someone familiar with regex and text to see if this is doable.

The logic here is simple: I am not going to be 100% correct, just to get rid of annoying breaks.

Since chinses do not have space to identify words, and no capital to identify beginning of sentence, that leads me to think the other way round: What can be used to identify the ending of a sentence?

That be: punctuations!


So, I will regex search a punctuation and a line break right next to each other, that will be 99% the ending of a paragraph!

And the rest is easy.

So this is what I come up with to find the ending:

([\..。\??\!!>》\))\]】}…::—'"’”\|」』@])\n

and replace it with:

\1@@@\n

and replace:

@@@ to \n


Of course, you need to prepare the text first by removing excess space and empty lines.

So what do you guys think? Is there anything to improve?
icearch is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
False paragraph breaks & RegEx ColMac Editor 9 10-21-2022 03:00 PM
Paragraph Regex FDPuthuff Sigil 2 09-27-2020 12:38 PM
How can I fix it when every line is a paragraph? Nyssa Editor 30 12-23-2014 08:23 PM
regex puzzle: finding paragraph before... cybmole Sigil 8 02-24-2012 09:06 AM
Chapters are one giant paragraph. How to fix? bfollowell Conversion 9 02-03-2011 01:20 PM


All times are GMT -4. The time now is 03:36 PM.


MobileRead.com is a privately owned, operated and funded community.