View Single Post
Old 05-20-2014, 02:44 AM   #4
kacir
Wizard
kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.
 
kacir's Avatar
 
Posts: 3,450
Karma: 10484861
Join Date: May 2006
Device: PocketBook 360, before it was Sony Reader, cassiopeia A-20
Quote:
Originally Posted by momtodogs View Post
I searched and found several good suggestions for cleaning these up (i.e. using Find & Replace in Word, using ^p for finding paragraph returns.)

This works fantastically in most cases; however, what "replace code" can I use when the paragraph return symbol is a little crooked down arrow and not the normal paragraph symbol used in Word.

I converted an .lrf to an .rtf file in Calibre, and all the returns, including the extra ones, use the arrow symbol. I tried cut&paste, but it doesn't register.

Is there a secret code for the arrow, like ^p for paragraph?
You go to the MSWord, you activate the search and replace dialog, you press options to get much larger dialog panel and then you select the "special" button at the bottom. It has the list of codes, including ^l - for the manual linebreak.

You are not looking for MSWord, you are looking for "Regular Expressions". Word has only limited abilities comparing to other tools, like Calibre. With the Regular expressions you can say:
Begin group one '\('
find one character that is not from this list: [.?!"']
End of group one '\)'
followed by an end-of-line-symbol \n
Replace with the contents of the group one followed by a space. '\1 '
In regular expression syntax that is something like
substitute/\([^.?!"']\)\n/\1 /
Unfortunately there are several dialects of RE, you will have to look it up in documentation. For example "begin a group that I will later refer to as '\1' (or '\2' and so on if it is second or a third group) is sometimes '\(' and sometimes just '('



You see, most of the linebreaks that do not follow: [.?!"] are not at the end of paragraph.
This is very quick and dirty, but can clean an OCRed book from unwanted line breaks with 99% accuracy

Regular expressions can look very intimidating if you just look at a complex one, but they are well worth learning. Calibre and many other advanced tools support them and you can start with a very simple ones and gradually write more and more complex REs. They will still be relatively difficult to read, because the metacharacter set is very dense so they can fit inside "search" and "replace" fields, but much easier to write after a bit of practice.
kacir is offline   Reply With Quote