MobileRead Forums - View Single Post - Removing unnecessary line breaks in books.

Wintersdark · 08-19-2010, 05:43 PM

Quote:

Originally Posted by kovidgoyal

Doing that will mean he'll lose all character formatting (italics, bold, etc). IIRC the TXT output plugin doesn't preserve those.

This isn't a problem. While character formatting is nice, having readable text at all is nicer.

You're right, it's not all .lit -> .epub, I'd (incorrectly) assumed that as several books I checked all suffered the same problem. Further investigation shows that's not the case - good news!

I tried converting to text and back, but the way it's formatted I basically get each paragraph followed by a pair of CR/LF's. So, converting directly back to epub doesn't help.

However, as it's not every book, I'm just addressing it on a case by case basis with Notepad++ as I go. If I were still running linux, I'd mass convert them all to text and figure out how to script applying the regex replace to them, but I have no idea of how to go about that in windows.

Unfortunately with Notepad++ you cannot use \r\n in regex expressions (who knows why), but you *can* replace (with "extended" searches) all the CRLF pairs with a unique identifier (I used QQQQ) then simply replace all .QQQQ and "QQQQ with \r\n\r\n, then all remaining QQQQ's with spaces. It's sort of a pain in the ass to have to do it one at a time, but it works at least.

If anyone knows a better tool to do this with - one that can macro the operations; or apply a regex directly, or better yet be applied in bulk, in windows, with a minimum of hassle for one not used to dealing with these things, I'd love to hear about it. But, even if not, this does work.

As a feature request for Calibre I'd definitely like to see, for this and other formatting issues, the ability to apply a regex directly in the conversion options (or some such easily accessible place). It would really help people cleaning up poor source material when converting to their ereader format of choice.