View Single Post
Old 12-30-2007, 08:03 PM   #9
dstampe
dstampe
dstampe began at the beginning.
 
Posts: 50
Karma: 17
Join Date: Jan 2007
Location: Canada
Device: Sony PRS-500
I have been working on some macros using Word, and these seem to do a fairly good job of splicing text. Somewhat crippled by the lack of full regular expressions in Word's search and replace, though.

In these examples, I have used "_" for a space, and "\" for backslash). I am just giving the find/replace strings, and am not going to do macro code examples. Someone else can test these on the current version of Word if they want, then summarize the macros. This is just to pass on the ideas.

The basic sequence is:

1) Ensure all paragraph marks are cleaned, so that these can be used in wildcard search (using the ^13 code):
Wildcards OFF:
Find: ^p
Replace: ^p

2) Clean spaces from beginning and ends of lines:
Wildcards OFF
Find ^w^p
Replace: ^p
Find ^p^w
Replace: ^p

Then clean up any unwanted blank lines, headers, footers, etc. Some books may also have quotes moved onto seperate lines, these need to be merged onto the prroper line as well.
It is also a good idea to remove hyphens at the end of lines (This should be done one by one):

1) Remove hyphens at end of lines (use interactive replace, check that text AFTER hypen is not a full word)
Wildcards ON
Find: -^13{1,3}([a-z])
Replace: \2

2) Remove any dangling quotes (may be uncommon). Note this is crippled ecause Word cannot search for "zero or more" of a search item):
Wildcards OFF
Find:^p"^p
Replace: "^p
or
Wildcards ON
Find:^13"^13
Replace: "^p

3) Headers and footers: can be a problem. One idea is to look for isolated lines with blank lines before and after, with numbers in them. This example looks for a line with a length of up to 60 characters. It uses "[!^13]" rather that "?" to force it to look at a single line. You can add matching for a number "<[0-9]@>" before or after the "[^13]{1,60}" item. Another alternative is to check for capitalized letters "[A-Z]{5,}" somewhere in the line.
Of cource, the replace here needs to be done interactively. It's a pain in Word sometimes, as the top of the found text is usually off the top of the display:

Wildcards ON
Find: ^13{2,6}[!^13]{1,60}^13{2,6}
Replace: <nothing>


Then the workhorse joining can be done:

1) join line with lowercase at start to previous line:
Wildcards ON
Find: ^13([a-z])
Replace: _\1

2) join line with lowercase at end to next line:
Wildcards ON
Find: ([a-z])^13
Replace: \1_

3) join line with comma at end to next line:
Wildcards ON
Find: ,^13
Replace: ,_

These simple replacements handle most books pretty well. Most other cases are ambiguous unless quotes are taken into account and are rare in practice. The longer the lines of text are, the fewer the errors.
dstampe is offline   Reply With Quote