View Single Post
Old 01-28-2008, 09:54 AM   #5
Gideon
Wearer of Pants
Gideon knows the square root of minus one.Gideon knows the square root of minus one.Gideon knows the square root of minus one.Gideon knows the square root of minus one.Gideon knows the square root of minus one.Gideon knows the square root of minus one.Gideon knows the square root of minus one.Gideon knows the square root of minus one.Gideon knows the square root of minus one.Gideon knows the square root of minus one.Gideon knows the square root of minus one.
 
Gideon's Avatar
 
Posts: 1,050
Karma: 7634
Join Date: Jan 2008
Location: Norman, OK
Device: Amazon Kindle DX / iPhone
As to OCR - I actually do it in Windows on my Mac. I have used OmniPage (and it's amazing) but only through work and I'm not terribly comfortble doing this at work.

Generally, I either use Adobe Acrobat's OCR (not the best) or I use the AbbyReader OCR that came with the OpticBook.

Now, the good news..... I found a way to do it.

This requires TextMate (but many other "pure text" programs may use it, this isn't something I know a lot about, but was able to get help on the TextMate news list.) It makes use of Regular Expressions - which, from what I gather, is exactly to what kovidgoyal suggested (regexps) except now I have the commands needed.

A very kind man named Peter Cowan broke it down for me:
Quote:
Open the find window (command + F)
Set the find to the following (no extra spaces):
\n[0-9]{1,3}.+\n

Set replace to nothing

Be sure "Regular Expression" is checked. You can see if it's
highlighting the right part by clicking next, before replacing all.

I'll break it down for you:

\n -- match the new line before the number

[0-9]{1,3} -- match a number that is 1 to 3 digits long, if your book
has more pages than that you'll need to change this. This also won't
work for pages with roman numerals

.+ -- match one or more of any character until the end of the line,
this gets the spaces and author name.

\n -- match the newline after the above match.

Check out the tutorial linked from the Textmate help, Regular
Expressions are very handy.

Peter
And for reverse text (that is, the other side of the page "<title> <page number>" - you'd put in (I believe this is what I did!) ".+[0-9]{1,3}\n" (without quotes)

Hope this helps others.
UPDATE: Okay, that second command (the one I cobbled together) doesn't work entirely. It worked on some, but not others... and I have no idea why. If anyone has any suggestions, I'm all ears.

UPDATE, Part II - I did some looking around and found out some of my lines have spaces at the end. So using this ".+[0-9]{1,3}\s\n" I was able to get the rest smoothly.

Last edited by Gideon; 01-28-2008 at 10:20 AM. Reason: update info
Gideon is offline   Reply With Quote