![]() |
#1 |
Connoisseur
![]() ![]() Posts: 61
Karma: 108
Join Date: Oct 2006
Location: LA,CA
|
![]()
I've got a rtf that I converted from a pdf... the pdf had annoying headers and footers on every page that I can't seem to get rid of? They're the file name "c://my%docs%....etc"
When I OCR the pdf to rtf these headers and footers become part of the text. I used MS Word to "find" and delete most of this stuff.... but I can't figure out how to make it "find" the part of the footer that has all the page numbers because they're different on each page. Example (1 of 100) (45 of 100) Is there a way to make word find (1 of x) and eliminate it? Or... if you know how to remove these headers from the PDF. (They don't show up the the headers options area) Thanks! |
![]() |
![]() |
![]() |
#2 |
Recovering Gadget Addict
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 5,381
Karma: 676161
Join Date: May 2004
Location: Pittsburgh, PA
Device: iPad
|
I've found similar problems when I do a "save as text" from Adobe Acrobat Reader, but I get page numbers and other artifacts like extra blank lines between pages and sometimes the very first letter of a chapter goes to the bottom of the page.
If it's just text, and the anomalies follow a consistent pattern, and you know a little *nix, maybe that can be done with the std text pattern tools. But, of course, that would be inaccessible if you don't know it well enough (like me). |
![]() |
![]() |
Advert | |
|
![]() |
#3 | |
Connoisseur
![]() ![]() Posts: 61
Karma: 108
Join Date: Oct 2006
Location: LA,CA
|
Quote:
Yeah... anyone know a quick way to get rid of those extra blank spaces it adds? |
|
![]() |
![]() |
![]() |
#4 | |
Evangelist
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 423
Karma: 1517132
Join Date: Jun 2006
Location: Madrid, Spain
Device: quaderno, remarkable2, yotaphone2, prs950, iliad, onhandpc, newton
|
Quote:
1. Install the Unix utils for Windows from: http://unxutils.sourceforge.net/ 2. Copy your file to C:\ 3. After installing the utilities, open a command line 4. Go to the root directory with > cd \ C:\> 4. What you are asking for (removing those text from yourfile.rtf and write result.rtf) is done the following way: c:\> sed "s/([0-9][0-9]* of [0-9][0-9]*)//g" yourfile.rtf > result.rtf Hope that works for you :-). Alternative: install winvi from http://www.winvi.de/en/download.html and open your file, then write ":%s/([0-9][0-9]* of [0-9][0-9]*)//g" without the quotes and ENTER, it will delete the offending parts of your file. NOTE: if it doeesn't work, it will surely be because of the parentheses; try again but with: c:> sed "s/.[0-9][0-9]* of [0-9][0-9]*.//g" file.rtf > result.rtf Hope that works for you ;-). |
|
![]() |
![]() |
![]() |
#5 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 3,442
Karma: 300001
Join Date: Sep 2006
Location: Belgium
Device: PRS-500/505/700, Kindle, Cybook Gen3, Words Gear
|
In Word, turn on "Use wildcards" and use the following pattern:
<[0-9]@ of [0-9]@> |
![]() |
![]() |
Advert | |
|
![]() |
#6 | |
Evangelist
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 423
Karma: 1517132
Join Date: Jun 2006
Location: Madrid, Spain
Device: quaderno, remarkable2, yotaphone2, prs950, iliad, onhandpc, newton
|
Quote:
C:\> sed "/([0-9][0-9]* of [0-9][0-9]*)/d" original.rtf > modified.rtf That command deletes the entire line containing those characters. Beware: sed operates on the file read as text, not as formatted in a wordprocessor, so you should open the file with notepad (or another text editor without rtf support) to see if those (nn of mm) are alone in a line. More info on what can be done with sed: http://www.student.northpark.edu/pem...d/sed1line.txt The things to search are detailed as "regular expressions", that is a special syntax to specify patterns of text (optionally with subsitutions). There are plenty of tutorials for regular expressions, for example: http://aspn.activestate.com/ASPN/doc...gex-intro.html And a reference page for advanced users: http://anaturb.net/sed.htm Last edited by Antartica; 11-06-2006 at 10:37 AM. |
|
![]() |
![]() |
![]() |
#7 | |
Retired & reading more!
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,764
Karma: 1884247
Join Date: Sep 2006
Location: North Alabama, USA
Device: Kindle 1, iPad Air 2, iPhone 6S+, Kobo Aura One
|
Quote:
"number1 of number2". However if you are looking for another page number format (e.g. "number" or "number1/number2") you will need to modify the search accordingly.
|
|
![]() |
![]() |
![]() |
#8 |
Connoisseur
![]() ![]() Posts: 61
Karma: 108
Join Date: Oct 2006
Location: LA,CA
|
Wow! You guys have crazy knowledge!
![]() I'll try those suggestions out tonight.. I love when I find good forums like this! You folks ROCK! Valky Last edited by valkyriesound; 11-06-2006 at 12:08 PM. |
![]() |
![]() |
![]() |
#9 | ||
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 3,463
Karma: 10684861
Join Date: May 2006
Device: PocketBook 360, before it was Sony Reader, cassiopeia A-20
|
*nix tools for processing text
Quote:
There are versions for Windows for each of those tools. My favourite tool is vim www.vim.org Vim is extremely powerfull text editor. It uses regular expressions. Regular expressions are what all those text processing *nix tools have in common. Regular expressions are a kind of language used to describe "patterns" in text See http://en.wikipedia.org/wiki/Regular_expressions Vim has a steep learning curve, but it is the most powerfull tool in the hands of an expert. If you want to try regular expressions, and you do not wish to learn the basics of vim try windows editor TextPad www.textpad.com Quote:
Open file in vim press Esc type :%substitute/ */ /g or in short :%s/ */ /g ^ that is :%s slash space space starr slash space slash g : means - this is an "ed" command. Ed is an anicent Unix editor. Its stream oriented version sed is one of the most used text parsing tools ever. % means apply the following command (substitute in this example) on all lines not just on the current line. % is an address. There are many kinds of addresses you can use for a command. This is one of the features that makes vim *SO* powerfull. g means - do not replace just the first occurence of pattern but all occurences on the line In textpad from menu start "find and replace" dialog search pattern * ^ that is space space star replace pattern ^ that is one space and check "use regular expressions" check box if you want to learn vim, just install it, start it and type Esc :help Vim has the best documentation I have *ever* seen in a program. |
||
![]() |
![]() |
![]() |
#10 |
Connoisseur
![]() ![]() Posts: 61
Karma: 108
Join Date: Oct 2006
Location: LA,CA
|
OK...
Got those number out! There must be an easy way of getting rid of this look in the text: "street torches as well as the somewhat brighter light of the moon. Sham knelt where she was, watching the dark mansion intently for movement that might indicate someone was inside." ?? |
![]() |
![]() |
![]() |
#11 | |
Evangelist
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 423
Karma: 1517132
Join Date: Jun 2006
Location: Madrid, Spain
Device: quaderno, remarkable2, yotaphone2, prs950, iliad, onhandpc, newton
|
Quote:
C:\> sed ":start;/[a-zA-Z] *\$/{N;s/ *\n/ /g;b start;}" mytext.txt > result.txt That joins the lines that end with an alphabetical character with the next one. NOTE: it's the first time that I use flow control in sed. Nice :-). I used the following manual to construct that command: http://www.grymoire.com/Unix/Sed.html |
|
![]() |
![]() |
![]() |
#12 |
Retired & reading more!
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,764
Karma: 1884247
Join Date: Sep 2006
Location: North Alabama, USA
Device: Kindle 1, iPad Air 2, iPhone 6S+, Kobo Aura One
|
You may also want to see Bob's post;
"Here's the process I'm using... https://www.mobileread.com/forums/sh...40958#post40958 " |
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
How to Destroy The Book - Cory Doctorow | Elfwreck | News | 44 | 12-19-2009 12:07 AM |
Seek VERY simple e-reader | littlereba | Which one should I buy? | 2 | 03-30-2009 09:49 PM |
Mystery and Crime Adams, Clifton: Whom Gods Destroy. v1, 30 Aug 2008 | Dr. Drib | BBeB/LRF Books (offline) | 0 | 08-30-2008 08:52 AM |
Bookstores seek new authors in two contests | mogui | News | 0 | 10-01-2007 11:35 PM |