View Single Post
Old 11-02-2010, 10:44 PM   #3
thrawn_aj
quantum mechanic
thrawn_aj ought to be getting tired of karma fortunes by now.thrawn_aj ought to be getting tired of karma fortunes by now.thrawn_aj ought to be getting tired of karma fortunes by now.thrawn_aj ought to be getting tired of karma fortunes by now.thrawn_aj ought to be getting tired of karma fortunes by now.thrawn_aj ought to be getting tired of karma fortunes by now.thrawn_aj ought to be getting tired of karma fortunes by now.thrawn_aj ought to be getting tired of karma fortunes by now.thrawn_aj ought to be getting tired of karma fortunes by now.thrawn_aj ought to be getting tired of karma fortunes by now.thrawn_aj ought to be getting tired of karma fortunes by now.
 
thrawn_aj's Avatar
 
Posts: 705
Karma: 483827
Join Date: Aug 2010
Location: NorCal
Device: Nook1, Samsung Transform, Nook2
Quote:
Originally Posted by kabloooie View Post
I have text and lit files that always come out with spaces between paragraphs instead of indentations.
There's a regular expression way to do it but I don't know how advanced the regex system in Calibre is or even if it can be co-opted by the user to edit the actual contents of the file.

In notepad++ (or any text editor that supports regex, with minor syntax mods) for instance, I would convert all linefeeds (\r\n usually) to some obscure character string that doesn't appear in your file (say, ###) using the extended mode search and replace. Note: if you have a multiline regex tool (I'm too lazy to use mine and npp is just too convenient in other ways) you could search for the double linefeeds directly and replace them with paragraph breaks and indents.

Then, using its native regex, search for something like ######([^#]+)###### (since there will be 2 linefeeds between paragraphs - and you don't want that) and replace it with ###\t\1###. Then back to extended mode and replace all ### with \r\n.

This is probably overkill for what you're asking but I think it's useful for other (similar) functions like wrapping <p> tags around paragraphs and other html manipulations. Cleaned up a bunch of OCR'd stuff last weekend using notepad++ .

By the way, I've noticed that the result is always more WYSIWYG if you focus your attention on a simply coded (clean that is) html file and then use that as the master format for converting to anything else (adding an html file to a book record saves it as zip). TOC creation and chapter creation is also much more transparent this way .
thrawn_aj is offline   Reply With Quote