MobileRead Forums - View Single Post - cleaning up extra returns formatting mess

kacir · 05-20-2014, 02:44 AM

Quote:

Originally Posted by momtodogs

I searched and found several good suggestions for cleaning these up (i.e. using Find & Replace in Word, using ^p for finding paragraph returns.)

This works fantastically in most cases; however, what "replace code" can I use when the paragraph return symbol is a little crooked down arrow and not the normal paragraph symbol used in Word.

I converted an .lrf to an .rtf file in Calibre, and all the returns, including the extra ones, use the arrow symbol. I tried cut&paste, but it doesn't register.

Is there a secret code for the arrow, like ^p for paragraph?

You go to the MSWord, you activate the search and replace dialog, you press options to get much larger dialog panel and then you select the "special" button at the bottom. It has the list of codes, including ^l - for the manual linebreak.

You are not looking for MSWord, you are looking for "Regular Expressions". Word has only limited abilities comparing to other tools, like Calibre. With the Regular expressions you can say:
Begin group one '\('
find one character that is not from this list: [.?!"']
End of group one '\)'
followed by an end-of-line-symbol \n
Replace with the contents of the group one followed by a space. '\1 '
In regular expression syntax that is something like
substitute/\([^.?!"']\)\n/\1 /
Unfortunately there are several dialects of RE, you will have to look it up in documentation. For example "begin a group that I will later refer to as '\1' (or '\2' and so on if it is second or a third group) is sometimes '\(' and sometimes just '('

You see, most of the linebreaks that do not follow: [.?!"] are not at the end of paragraph.
This is very quick and dirty, but can clean an OCRed book from unwanted line breaks with 99% accuracy

Regular expressions can look very intimidating if you just look at a complex one, but they are well worth learning. Calibre and many other advanced tools support them and you can start with a very simple ones and gradually write more and more complex REs. They will still be relatively difficult to read, because the metacharacter set is very dense so they can fit inside "search" and "replace" fields, but much easier to write after a bit of practice.