View Single Post
Old 08-10-2016, 02:58 AM   #6
kacir
Wizard
kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.
 
kacir's Avatar
 
Posts: 3,463
Karma: 10684861
Join Date: May 2006
Device: PocketBook 360, before it was Sony Reader, cassiopeia A-20
Quote:
Originally Posted by AlexBell View Post
... Also, the author has not separated 'real' paragraphs in the text.

Can anyone suggest a way to remove to get rid of the excess carriage returns/paragraphs?
well ... it depends on how the book is formatted.

I personally would use Regular expressions.
If there is something like an empty line between real paragraphs, I would do a quick solution as the Doitsu in previous post suggested.
If there is no empty line between paragraphs there might be a tab character at the beginning of the paragraph or, if you are lucky a few spaces, or the line might have different intent. I would try to use that.

If all else fails I would find all lines that end with a dot followed by a CRLF and replace it with something like ### real paragraph here ###, then do the same thing for question mark, exclamation point, and also dot followed by a [closing] quote mark ... you get the idea.
Then I would replace all CRLF with a space, replace all the ### real paragraph here ### markers with CRLF and then check for two consecutive spaces (several times, after there are no more to replace).

Or, you could craft a regular expression that would replace any letter followed by a CRLF (end of line/paragraph) with the same letter followed by a space.

Another trick would be to use elaborate algorithm that OCR programs use. Just print the text into a pdf and run that through OCR program ;-)
OCR programs use the tricks described above, plus they look at the number of characters on line, they look at the justification, if the text is fully justified and many other clever tricks.

It also depends on how much of the original formatting from the word you want to preserve.
I might just use search and replace from Word to insert formatting markup looking for specific formatting (such as style) and placing marks like {H1} at the beginning of the text where formatting changes and then export the text to a *.txt file and massage that with a powerful editor with real regular expressions (Gvim is my choice).
kacir is offline   Reply With Quote