Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Readers > Sony Reader

Notices

Reply
 
Thread Tools Search this Thread
Old 11-06-2006, 08:24 AM   #1
valkyriesound
Connoisseur
valkyriesound doesn't littervalkyriesound doesn't litter
 
Posts: 61
Karma: 108
Join Date: Oct 2006
Location: LA,CA
Question Help...seek and destroy with MS Word

I've got a rtf that I converted from a pdf... the pdf had annoying headers and footers on every page that I can't seem to get rid of? They're the file name "c://my%docs%....etc"

When I OCR the pdf to rtf these headers and footers become part of the text.
I used MS Word to "find" and delete most of this stuff.... but I can't figure out how to make it "find" the part of the footer that has all the page numbers because they're different on each page. Example (1 of 100) (45 of 100)
Is there a way to make word find (1 of x) and eliminate it?

Or... if you know how to remove these headers from the PDF. (They don't show up the the headers options area)

Thanks!
valkyriesound is offline   Reply With Quote
Old 11-06-2006, 08:40 AM   #2
Bob Russell
Recovering Gadget Addict
Bob Russell ought to be getting tired of karma fortunes by now.Bob Russell ought to be getting tired of karma fortunes by now.Bob Russell ought to be getting tired of karma fortunes by now.Bob Russell ought to be getting tired of karma fortunes by now.Bob Russell ought to be getting tired of karma fortunes by now.Bob Russell ought to be getting tired of karma fortunes by now.Bob Russell ought to be getting tired of karma fortunes by now.Bob Russell ought to be getting tired of karma fortunes by now.Bob Russell ought to be getting tired of karma fortunes by now.Bob Russell ought to be getting tired of karma fortunes by now.Bob Russell ought to be getting tired of karma fortunes by now.
 
Bob Russell's Avatar
 
Posts: 5,312
Karma: 590687
Join Date: May 2004
Location: Pittsburgh, PA
Device: Note3, MacBook Air, iPad Air
I've found similar problems when I do a "save as text" from Adobe Acrobat Reader, but I get page numbers and other artifacts like extra blank lines between pages and sometimes the very first letter of a chapter goes to the bottom of the page.

If it's just text, and the anomalies follow a consistent pattern, and you know a little *nix, maybe that can be done with the std text pattern tools. But, of course, that would be inaccessible if you don't know it well enough (like me).
Bob Russell is offline   Reply With Quote
Old 11-06-2006, 10:04 AM   #3
valkyriesound
Connoisseur
valkyriesound doesn't littervalkyriesound doesn't litter
 
Posts: 61
Karma: 108
Join Date: Oct 2006
Location: LA,CA
Quote:
Originally Posted by Bob Russell

If it's just text, and the anomalies follow a consistent pattern, and you know a little *nix, maybe that can be done with the std text pattern tools. But, of course, that would be inaccessible if you don't know it well enough (like me).
I'm new... what's *nix and std text pattern tools? Links?

Yeah... anyone know a quick way to get rid of those extra blank spaces it adds?
valkyriesound is offline   Reply With Quote
Old 11-06-2006, 10:13 AM   #4
Antartica
Evangelist
Antartica has learned how to read e-booksAntartica has learned how to read e-booksAntartica has learned how to read e-booksAntartica has learned how to read e-booksAntartica has learned how to read e-booksAntartica has learned how to read e-booksAntartica has learned how to read e-books
 
Posts: 415
Karma: 754
Join Date: Jun 2006
Location: Madrid, Spain
Device: iliad, onhandpc, newton, zaurus
Quote:
Originally Posted by valkyriesound
Example (1 of 100) (45 of 100)
Is there a way to make word find (1 of x) and eliminate it?
Not with word (I'm addicted to UNIX), but if you don't mind to install some little program to do it, here is a way to do it:

1. Install the Unix utils for Windows from:
http://unxutils.sourceforge.net/

2. Copy your file to C:\

3. After installing the utilities, open a command line

4. Go to the root directory with
> cd \
C:\>

4. What you are asking for (removing those text from yourfile.rtf and write result.rtf) is done the following way:
c:\> sed "s/([0-9][0-9]* of [0-9][0-9]*)//g" yourfile.rtf > result.rtf

Hope that works for you :-).

Alternative: install winvi from http://www.winvi.de/en/download.html and open your file, then write ":%s/([0-9][0-9]* of [0-9][0-9]*)//g" without the quotes and ENTER, it will delete the offending parts of your file.

NOTE: if it doeesn't work, it will surely be because of the parentheses; try again but with:
c:> sed "s/.[0-9][0-9]* of [0-9][0-9]*.//g" file.rtf > result.rtf

Hope that works for you ;-).
Antartica is offline   Reply With Quote
Old 11-06-2006, 10:22 AM   #5
igorsk
Wizard
igorsk reads XML... blindfoldedigorsk reads XML... blindfoldedigorsk reads XML... blindfoldedigorsk reads XML... blindfoldedigorsk reads XML... blindfoldedigorsk reads XML... blindfoldedigorsk reads XML... blindfoldedigorsk reads XML... blindfoldedigorsk reads XML... blindfoldedigorsk reads XML... blindfoldedigorsk reads XML... blindfolded
 
Posts: 3,443
Karma: 52235
Join Date: Sep 2006
Location: Belgium
Device: PRS-500/505/700, Kindle, Cybook Gen3, Words Gear
In Word, turn on "Use wildcards" and use the following pattern:
<[0-9]@ of [0-9]@>
igorsk is offline   Reply With Quote
Old 11-06-2006, 10:34 AM   #6
Antartica
Evangelist
Antartica has learned how to read e-booksAntartica has learned how to read e-booksAntartica has learned how to read e-booksAntartica has learned how to read e-booksAntartica has learned how to read e-booksAntartica has learned how to read e-booksAntartica has learned how to read e-books
 
Posts: 415
Karma: 754
Join Date: Jun 2006
Location: Madrid, Spain
Device: iliad, onhandpc, newton, zaurus
Quote:
Originally Posted by valkyriesound
I'm new... what's *nix and std text pattern tools? Links?

Yeah... anyone know a quick way to get rid of those extra blank spaces it adds?
Again, using sed:

C:\> sed "/([0-9][0-9]* of [0-9][0-9]*)/d" original.rtf > modified.rtf

That command deletes the entire line containing those characters. Beware: sed operates on the file read as text, not as formatted in a wordprocessor, so you should open the file with notepad (or another text editor without rtf support) to see if those (nn of mm) are alone in a line.

More info on what can be done with sed:
http://www.student.northpark.edu/pem...d/sed1line.txt

The things to search are detailed as "regular expressions", that is a special syntax to specify patterns of text (optionally with subsitutions). There are plenty of tutorials for regular expressions, for example:
http://aspn.activestate.com/ASPN/doc...gex-intro.html

And a reference page for advanced users:
http://anaturb.net/sed.htm

Last edited by Antartica; 11-06-2006 at 10:37 AM.
Antartica is offline   Reply With Quote
Old 11-06-2006, 11:18 AM   #7
slayda
Retired & reading more!
slayda ought to be getting tired of karma fortunes by now.slayda ought to be getting tired of karma fortunes by now.slayda ought to be getting tired of karma fortunes by now.slayda ought to be getting tired of karma fortunes by now.slayda ought to be getting tired of karma fortunes by now.slayda ought to be getting tired of karma fortunes by now.slayda ought to be getting tired of karma fortunes by now.slayda ought to be getting tired of karma fortunes by now.slayda ought to be getting tired of karma fortunes by now.slayda ought to be getting tired of karma fortunes by now.slayda ought to be getting tired of karma fortunes by now.
 
slayda's Avatar
 
Posts: 2,731
Karma: 884247
Join Date: Sep 2006
Location: North Alabama, USA
Device: Kindle 1, iPad 4, iPhone 5
Quote:
Originally Posted by igorsk
In Word, turn on "Use wildcards" and use the following pattern:
<[0-9]@ of [0-9]@>
This will find page numbers in the form;

"number1 of number2".

However if you are looking for another page number format (e.g. "number" or "number1/number2") you will need to modify the search accordingly.
  1. for "number" use <[0-9]@{1,5} which will find all occurance of numbers of 5 or less digits
  2. for "number1/number2" use <[0-9]@/[0-9]@>
slayda is offline   Reply With Quote
Old 11-06-2006, 12:05 PM   #8
valkyriesound
Connoisseur
valkyriesound doesn't littervalkyriesound doesn't litter
 
Posts: 61
Karma: 108
Join Date: Oct 2006
Location: LA,CA
Wow! You guys have crazy knowledge!

I'll try those suggestions out tonight..

I love when I find good forums like this!

You folks ROCK!

Valky

Last edited by valkyriesound; 11-06-2006 at 12:08 PM.
valkyriesound is offline   Reply With Quote
Old 11-06-2006, 01:36 PM   #9
kacir
Wizard
kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.
 
kacir's Avatar
 
Posts: 2,679
Karma: 2799391
Join Date: May 2006
Device: PocketBook 360, before it was Sony Reader, cassiopeia A-20
*nix tools for processing text

Quote:
Originally Posted by valkyriesound
I'm new... what's *nix and std text pattern tools? Links?
like sed, awk, emacs, vim, vi, grep, and countless others
There are versions for Windows for each of those tools. My favourite tool is vim
www.vim.org
Vim is extremely powerfull text editor. It uses regular expressions. Regular expressions are what all those text processing *nix tools have in common.
Regular expressions are a kind of language used to describe "patterns" in text
See http://en.wikipedia.org/wiki/Regular_expressions
Vim has a steep learning curve, but it is the most powerfull tool in the hands of an expert.
If you want to try regular expressions, and you do not wish to learn the basics of vim try windows editor TextPad www.textpad.com

Quote:
Originally Posted by valkyriesound
Yeah... anyone know a quick way to get rid of those extra blank spaces it adds?
It is very simple.
Open file in vim
press Esc
type
:%substitute/ */ /g
or in short
:%s/ */ /g
^ that is :%s slash space space starr slash space slash g
: means - this is an "ed" command. Ed is an anicent Unix editor. Its stream oriented version sed is one of the most used text parsing tools ever.
% means apply the following command (substitute in this example) on all lines not just on the current line. % is an address. There are many kinds of addresses you can use for a command. This is one of the features that makes vim *SO* powerfull.
g means - do not replace just the first occurence of pattern but all occurences on the line


In textpad
from menu start "find and replace" dialog
search pattern
*
^ that is space space star
replace pattern

^ that is one space
and check "use regular expressions" check box


if you want to learn vim, just install it, start it and type
Esc
:help

Vim has the best documentation I have *ever* seen in a program.
kacir is offline   Reply With Quote
Old 11-06-2006, 08:56 PM   #10
valkyriesound
Connoisseur
valkyriesound doesn't littervalkyriesound doesn't litter
 
Posts: 61
Karma: 108
Join Date: Oct 2006
Location: LA,CA
OK...

Got those number out!

There must be an easy way of getting rid of this look in the text:

"street torches as well as the somewhat brighter light of the moon.
Sham knelt
where she was, watching the dark mansion intently for movement that might
indicate someone was inside."

??
valkyriesound is offline   Reply With Quote
Old 11-07-2006, 02:51 AM   #11
Antartica
Evangelist
Antartica has learned how to read e-booksAntartica has learned how to read e-booksAntartica has learned how to read e-booksAntartica has learned how to read e-booksAntartica has learned how to read e-booksAntartica has learned how to read e-booksAntartica has learned how to read e-books
 
Posts: 415
Karma: 754
Join Date: Jun 2006
Location: Madrid, Spain
Device: iliad, onhandpc, newton, zaurus
Quote:
Originally Posted by valkyriesound
There must be an easy way of getting rid of this look in the text:

"street torches as well as the somewhat brighter light of the moon.
Sham knelt
where she was, watching the dark mansion intently for movement that might
indicate someone was inside."
Again with sed, if that text is in mytext.txt:

C:\> sed ":start;/[a-zA-Z] *\$/{N;s/ *\n/ /g;b start;}" mytext.txt > result.txt

That joins the lines that end with an alphabetical character with the next one.

NOTE: it's the first time that I use flow control in sed. Nice :-). I used the following manual to construct that command:
http://www.grymoire.com/Unix/Sed.html
Antartica is offline   Reply With Quote
Old 11-07-2006, 11:32 AM   #12
slayda
Retired & reading more!
slayda ought to be getting tired of karma fortunes by now.slayda ought to be getting tired of karma fortunes by now.slayda ought to be getting tired of karma fortunes by now.slayda ought to be getting tired of karma fortunes by now.slayda ought to be getting tired of karma fortunes by now.slayda ought to be getting tired of karma fortunes by now.slayda ought to be getting tired of karma fortunes by now.slayda ought to be getting tired of karma fortunes by now.slayda ought to be getting tired of karma fortunes by now.slayda ought to be getting tired of karma fortunes by now.slayda ought to be getting tired of karma fortunes by now.
 
slayda's Avatar
 
Posts: 2,731
Karma: 884247
Join Date: Sep 2006
Location: North Alabama, USA
Device: Kindle 1, iPad 4, iPhone 5
You may also want to see Bob's post;

"Here's the process I'm using...
http://www.mobileread.com/forums/sh...40958#post40958 "
slayda is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
How to Destroy The Book - Cory Doctorow Elfwreck News 44 12-19-2009 12:07 AM
Seek VERY simple e-reader littlereba Which one should I buy? 2 03-30-2009 09:49 PM
Mystery and Crime Adams, Clifton: Whom Gods Destroy. v1, 30 Aug 2008 Dr. Drib BBeB/LRF Books 0 08-30-2008 08:52 AM
Bookstores seek new authors in two contests mogui News 0 10-01-2007 11:35 PM


All times are GMT -4. The time now is 12:24 PM.


MobileRead.com is a privately owned, operated and funded community.