Help...seek and destroy with MS Word

valkyriesound · 11-06-2006, 08:24 AM

I've got a rtf that I converted from a pdf... the pdf had annoying headers and footers on every page that I can't seem to get rid of? They're the file name "c://my%docs%....etc"

When I OCR the pdf to rtf these headers and footers become part of the text.
I used MS Word to "find" and delete most of this stuff.... but I can't figure out how to make it "find" the part of the footer that has all the page numbers because they're different on each page. Example (1 of 100) (45 of 100)
Is there a way to make word find (1 of x) and eliminate it?

Or... if you know how to remove these headers from the PDF. (They don't show up the the headers options area)

Thanks!

Bob Russell · 11-06-2006, 08:40 AM

I've found similar problems when I do a "save as text" from Adobe Acrobat Reader, but I get page numbers and other artifacts like extra blank lines between pages and sometimes the very first letter of a chapter goes to the bottom of the page.

If it's just text, and the anomalies follow a consistent pattern, and you know a little *nix, maybe that can be done with the std text pattern tools. But, of course, that would be inaccessible if you don't know it well enough (like me).

valkyriesound · 11-06-2006, 10:04 AM

Quote:

Originally Posted by Bob Russell

If it's just text, and the anomalies follow a consistent pattern, and you know a little *nix, maybe that can be done with the std text pattern tools. But, of course, that would be inaccessible if you don't know it well enough (like me).

I'm new... what's *nix and std text pattern tools? Links?

Yeah... anyone know a quick way to get rid of those extra blank spaces it adds?

Antartica · 11-06-2006, 10:13 AM

Quote:

Originally Posted by valkyriesound

Example (1 of 100) (45 of 100)
Is there a way to make word find (1 of x) and eliminate it?

Not with word (I'm addicted to UNIX), but if you don't mind to install some little program to do it, here is a way to do it:

1. Install the Unix utils for Windows from:
http://unxutils.sourceforge.net/

2. Copy your file to C:\

3. After installing the utilities, open a command line

4. Go to the root directory with
> cd \
C:\>

4. What you are asking for (removing those text from yourfile.rtf and write result.rtf) is done the following way:
c:\> sed "s/([0-9][0-9]* of [0-9][0-9]*)//g" yourfile.rtf > result.rtf

Hope that works for you :-).

Alternative: install winvi from http://www.winvi.de/en/download.html and open your file, then write ":%s/([0-9][0-9]* of [0-9][0-9]*)//g" without the quotes and ENTER, it will delete the offending parts of your file.

NOTE: if it doeesn't work, it will surely be because of the parentheses; try again but with:
c:> sed "s/.[0-9][0-9]* of [0-9][0-9]*.//g" file.rtf > result.rtf

Hope that works for you ;-).

igorsk · 11-06-2006, 10:22 AM

In Word, turn on "Use wildcards" and use the following pattern:
<[0-9]@ of [0-9]@>

Antartica · 11-06-2006, 10:34 AM

Quote:

Originally Posted by valkyriesound

I'm new... what's *nix and std text pattern tools? Links?

Yeah... anyone know a quick way to get rid of those extra blank spaces it adds?

Again, using sed:

C:\> sed "/([0-9][0-9]* of [0-9][0-9]*)/d" original.rtf > modified.rtf

That command deletes the entire line containing those characters. Beware: sed operates on the file read as text, not as formatted in a wordprocessor, so you should open the file with notepad (or another text editor without rtf support) to see if those (nn of mm) are alone in a line.

More info on what can be done with sed:
http://www.student.northpark.edu/pem...d/sed1line.txt

The things to search are detailed as "regular expressions", that is a special syntax to specify patterns of text (optionally with subsitutions). There are plenty of tutorials for regular expressions, for example:
http://aspn.activestate.com/ASPN/doc...gex-intro.html

And a reference page for advanced users:
http://anaturb.net/sed.htm

slayda · 11-06-2006, 11:18 AM

Quote:

Originally Posted by igorsk

In Word, turn on "Use wildcards" and use the following pattern:
<[0-9]@ of [0-9]@>

This will find page numbers in the form;

"number1 of number2".

However if you are looking for another page number format (e.g. "number" or "number1/number2") you will need to modify the search accordingly.

for "number" use <[0-9]@{1,5} which will find all occurance of numbers of 5 or less digits
for "number1/number2" use <[0-9]@/[0-9]@>

valkyriesound · 11-06-2006, 12:05 PM

Wow! You guys have crazy knowledge!

I'll try those suggestions out tonight..

I love when I find good forums like this!

You folks ROCK!

Valky

kacir · 11-06-2006, 01:36 PM

Quote:

Originally Posted by valkyriesound

I'm new... what's *nix and std text pattern tools? Links?

like sed, awk, emacs, vim, vi, grep, and countless others
There are versions for Windows for each of those tools. My favourite tool is vim
www.vim.org
Vim is extremely powerfull text editor. It uses regular expressions. Regular expressions are what all those text processing *nix tools have in common.
Regular expressions are a kind of language used to describe "patterns" in text
See http://en.wikipedia.org/wiki/Regular_expressions
Vim has a steep learning curve, but it is the most powerfull tool in the hands of an expert.
If you want to try regular expressions, and you do not wish to learn the basics of vim try windows editor TextPad www.textpad.com

Quote:

Originally Posted by valkyriesound

Yeah... anyone know a quick way to get rid of those extra blank spaces it adds?

It is very simple.
Open file in vim
press Esc
type
:%substitute/ */ /g
or in short
:%s/ */ /g
^ that is :%s slash space space starr slash space slash g
: means - this is an "ed" command. Ed is an anicent Unix editor. Its stream oriented version sed is one of the most used text parsing tools ever.
% means apply the following command (substitute in this example) on all lines not just on the current line. % is an address. There are many kinds of addresses you can use for a command. This is one of the features that makes vim *SO* powerfull.
g means - do not replace just the first occurence of pattern but all occurences on the line

In textpad
from menu start "find and replace" dialog
search pattern
*
^ that is space space star
replace pattern

^ that is one space
and check "use regular expressions" check box

if you want to learn vim, just install it, start it and type
Esc
:help

Vim has the best documentation I have *ever* seen in a program.

valkyriesound · 11-06-2006, 08:56 PM

OK...

Got those number out!

There must be an easy way of getting rid of this look in the text:

"street torches as well as the somewhat brighter light of the moon.
Sham knelt
where she was, watching the dark mansion intently for movement that might
indicate someone was inside."

??

Antartica · 11-07-2006, 02:51 AM

Quote:

Originally Posted by valkyriesound

There must be an easy way of getting rid of this look in the text:

"street torches as well as the somewhat brighter light of the moon.
Sham knelt
where she was, watching the dark mansion intently for movement that might
indicate someone was inside."

Again with sed, if that text is in mytext.txt:

C:\> sed ":start;/[a-zA-Z] *\$/{N;s/ *\n/ /g;b start;}" mytext.txt > result.txt

That joins the lines that end with an alphabetical character with the next one.

NOTE: it's the first time that I use flow control in sed. Nice :-). I used the following manual to construct that command:
http://www.grymoire.com/Unix/Sed.html

slayda · 11-07-2006, 11:32 AM

You may also want to see Bob's post;

"Here's the process I'm using...
https://www.mobileread.com/forums/sh...40958#post40958 "

11-06-2006, 08:24 AM	#1
valkyriesound Connoisseur Posts: 61 Karma: 108 Join Date: Oct 2006 Location: LA,CA	Help...seek and destroy with MS Word I've got a rtf that I converted from a pdf... the pdf had annoying headers and footers on every page that I can't seem to get rid of? They're the file name "c://my%docs%....etc" When I OCR the pdf to rtf these headers and footers become part of the text. I used MS Word to "find" and delete most of this stuff.... but I can't figure out how to make it "find" the part of the footer that has all the page numbers because they're different on each page. Example (1 of 100) (45 of 100) Is there a way to make word find (1 of x) and eliminate it? Or... if you know how to remove these headers from the PDF. (They don't show up the the headers options area) Thanks!

11-06-2006, 12:05 PM	#8
valkyriesound Connoisseur Posts: 61 Karma: 108 Join Date: Oct 2006 Location: LA,CA	Wow! You guys have crazy knowledge! I'll try those suggestions out tonight.. I love when I find good forums like this! You folks ROCK! Valky Last edited by valkyriesound; 11-06-2006 at 12:08 PM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
How to Destroy The Book - Cory Doctorow	Elfwreck	News	44	12-19-2009 12:07 AM
Seek VERY simple e-reader	littlereba	Which one should I buy?	2	03-30-2009 09:49 PM
Mystery and Crime Adams, Clifton: Whom Gods Destroy. v1, 30 Aug 2008	Dr. Drib	BBeB/LRF Books (offline)	0	08-30-2008 08:52 AM
Bookstores seek new authors in two contests	mogui	News	0	10-01-2007 11:35 PM

11-06-2006, 08:40 AM	#2
Bob Russell Recovering Gadget Addict Posts: 5,381 Karma: 676161 Join Date: May 2004 Location: Pittsburgh, PA Device: iPad	I've found similar problems when I do a "save as text" from Adobe Acrobat Reader, but I get page numbers and other artifacts like extra blank lines between pages and sometimes the very first letter of a chapter goes to the bottom of the page. If it's just text, and the anomalies follow a consistent pattern, and you know a little *nix, maybe that can be done with the std text pattern tools. But, of course, that would be inaccessible if you don't know it well enough (like me).

11-06-2006, 10:22 AM	#5
igorsk Wizard Posts: 3,442 Karma: 300001 Join Date: Sep 2006 Location: Belgium Device: PRS-500/505/700, Kindle, Cybook Gen3, Words Gear	In Word, turn on "Use wildcards" and use the following pattern: <[0-9]@ of [0-9]@>

11-06-2006, 08:56 PM	#10
valkyriesound Connoisseur Posts: 61 Karma: 108 Join Date: Oct 2006 Location: LA,CA	OK... Got those number out! There must be an easy way of getting rid of this look in the text: "street torches as well as the somewhat brighter light of the moon. Sham knelt where she was, watching the dark mansion intently for movement that might indicate someone was inside." ??

11-07-2006, 11:32 AM	#12
slayda Retired & reading more! Posts: 2,764 Karma: 1884247 Join Date: Sep 2006 Location: North Alabama, USA Device: Kindle 1, iPad Air 2, iPhone 6S+, Kobo Aura One	You may also want to see Bob's post; "Here's the process I'm using... https://www.mobileread.com/forums/sh...40958#post40958 "

Advert

Advert