MobileRead Forums

MobileRead Forums (https://www.mobileread.com/forums/index.php)
-   Calibre (https://www.mobileread.com/forums/forumdisplay.php?f=166)
-   -   Removing excess carriage returns (https://www.mobileread.com/forums/showthread.php?t=47044)

Halk 05-17-2009 08:30 AM

Removing excess carriage returns
 
I have some old txt files that I'm trying to switch to ebooks.

Many of them have sentences broken by carriage returns.
E.g.
"The sentence is fine, and in most cases paragraphs are in tact, but perhaps one
in every five sentences contains a carriage return in the middle, which is mildly annoying when reading on my Cybook."

The common factor is that there's a no punctuation before the carriage return. Is there any way to sort this out? I was thinking perhaps if I could get Calibre to delete any carriage returns that were not preceeded by .!? or ." !" ?"

comtrjl 05-17-2009 08:54 AM

The 'common factor' is probably that these misplaced carriage returns are followed by lowercase letters (not necessarily every single time - but mostly).
If you have MSWord or similar, you could try doing a search, or search and replace, for ^13[a-z].
bob

user_none 05-17-2009 09:44 AM

This bit of python code should work for what you want:

Code:

>>> f = open('test', 'rb+wb')
>>> text = f.read()
>>> text = text.replace('\n\r', '\n')
>>> text = text.replace('\r', ' ')
>>> text = text.replace('\n', '\n\r')
>>> f.seek(0)
>>> f.truncate(0)
>>> f.write(text)
>>> f.close()

\n\r is the standard newline indicator for Windows system. We replace it with \n which is used on Unix systems. This is so we can replace all single occurances of \r with a single space. Then we put the \n's back to \n\r's.

gwynevans 05-17-2009 09:55 AM

Quote:

Originally Posted by Halk (Post 460998)
The common factor is that there's a no punctuation before the carriage return. Is there any way to sort this out? I was thinking perhaps if I could get Calibre to delete any carriage returns that were not preceeded by .!? or ." !" ?"

Frankly, a decent text editor is all you'd need for this - something like UltraEdit or TextPad or similar that can handle regular expression replacements. In UE, it'd be something like Replace "\r\n([a-z])" with " \1".

rogue_ronin 05-17-2009 01:47 PM

Don't forget to add a space! Or you'll be spell checking for days because you stuck two words together at each join.

It's easy to switch all occurrences of multiple spaces to one space, though, if you happen to double up. So first...

Find:
Code:

\r\n([a-z])
Replace:
Code:

\s$1
which will concatenate your lines, then...

Find:
Code:

([a-z])\s+
Replace:
Code:

$1\s
The second regex above should preserve punctuation that has two spaces (or more) following it. It also won't find the extra space that follows should a hyphen, colon, semi-colon, etc., or some other non-lower-case-letter, somehow be at the end of a line that's joined.

Try it on a copy first.

m a r

Halk 05-17-2009 03:35 PM

Thanks folks!


All times are GMT -4. The time now is 06:15 PM.

Powered by: vBulletin
Copyright ©2000 - 3.8.5, Jelsoft Enterprises Ltd.
MobileRead.com is a privately owned, operated and funded community.