As to OCR - I actually do it in Windows on my Mac. I have used OmniPage (and it's amazing) but only through work and I'm not terribly comfortble doing this at work.
Generally, I either use Adobe Acrobat's OCR (not the best) or I use the AbbyReader OCR that came with the OpticBook.
Now, the good news.....
I found a way to do it.
This requires TextMate (but many other "pure text" programs may use it, this isn't something I know a lot about, but was able to get help on the TextMate news list.) It makes use of Regular Expressions - which, from what I gather, is exactly to what kovidgoyal suggested (regexps) except now I have the commands needed.
A very kind man named Peter Cowan broke it down for me:
Quote:
Open the find window (command + F)
Set the find to the following (no extra spaces):
\n[0-9]{1,3}.+\n
Set replace to nothing
Be sure "Regular Expression" is checked. You can see if it's
highlighting the right part by clicking next, before replacing all.
I'll break it down for you:
\n -- match the new line before the number
[0-9]{1,3} -- match a number that is 1 to 3 digits long, if your book
has more pages than that you'll need to change this. This also won't
work for pages with roman numerals
.+ -- match one or more of any character until the end of the line,
this gets the spaces and author name.
\n -- match the newline after the above match.
Check out the tutorial linked from the Textmate help, Regular
Expressions are very handy.
Peter
|
And for reverse text (that is, the other side of the page "<title> <page number>" - you'd put in (I believe this is what I did!) ".+[0-9]{1,3}\n" (without quotes)
Hope this helps others.
UPDATE: Okay, that second command (the one I cobbled together) doesn't work entirely. It worked on some, but not others... and I have no idea why. If anyone has any suggestions, I'm all ears.
UPDATE, Part II - I did some looking around and found out some of my lines have spaces at the end. So using this ".+[0-9]{1,3}\s\n" I was able to get the rest smoothly.