View Full Version : eBook Text Editing


Gideon
01-27-2008, 11:47 PM
Okay.... so there is, no doubt, somewhere around here an answer to this. Unfortunately, I wouldn't begin to know how to search for it. So pardon the repetition.

As I've mentioned in other places, I've been scanning many of my books. Some of which (research/school) I like to keep the page numbers in for citation purposes, and the OCR process does that well enough.

However, with fiction, this isn't working out too well. I can get rid of the hard returns and unwrap the text, but I've not worked out a way to remove the page number/title/author at the top of the scanned page in any efficient way.

So, take this chunk, for instance:


Ser Waymar Royce glanced at the sky with disinterest. "It does that every
day
about this time. Are you unmanned by the dark, Gared?"
Will could see the tightness around Gared's mouth, the barely sup
2 <authors name>
pressed anger in his eyes under the thick black hood of his cloak. Gared had
spent forty years in the Night's Watch, man and boy, and he was not
accustomed
to being made light of. Yet it was more than that. Under the wounded pride,
Will could sense something else in the older man. Yonervous tension that
came perilous close to fear.


So I need a way to remove the the page number line on every page, and remove the hard return so that I can then reflow the text normally.

I'm on a Mac - and would think TextMate the answer - but I have windows installed if anyone knows of any free windows software that can help.

I also have an Ubuntu installation available - but I'm not so comfortable with linux as to like this option if I can avoid it.
Thanks!

Gideon
01-28-2008, 12:28 AM
Heh... I wish. Time and experience have taught me that whatever brains I do have, I certainly can't hack it at any sort of programming.

Hell, even after years running a game development group I could, at best, read the script but couldn't script my own way out of a paper bag.

As to the other suggestion... I've actually looked, and not been successful Also, as this issue will crop up from time to time, I'd really like to figure out a way around this.

Darqref
01-28-2008, 03:56 AM
As I've mentioned in other places, I've been scanning many of my books. Some of which (research/school) I like to keep the page numbers in for citation purposes, and the OCR process does that well enough.

However, with fiction, this isn't working out too well. I can get rid of the hard returns and unwrap the text, but I've not worked out a way to remove the page number/title/author at the top of the scanned page in any efficient way.

(snip sample)

So I need a way to remove the the page number line on every page, and remove the hard return so that I can then reflow the text normally.

I'm on a Mac - and would think TextMate the answer - but I have windows installed if anyone knows of any free windows software that can help.

I also have an Ubuntu installation available - but I'm not so comfortable with linux as to like this option if I can avoid it.
Thanks!

How are you scanning your source? What software are you using for OCR?

I've been (slowly) scanning a couple of books with a home-made stand, a digital camera, and a copy of OmniPage 16 (on WinXP). Part of the process of the OCR involves manipulating the images before actually running the OCR. At that time, I adjust the boundaries of the boxes that indicate what part of the image will be processed, to exclude the page headers. Because I'm using a digital camera, I can't exactly focus the lens on ONLY the page, and there is always a border of some sort that includes the stand and the clamps I'm using to hold the page in place. So I need to adjust the borders anyway, and clipping the page numbers, etc. is easy.

If you are not doing it manually, there must be some software mechanism to select what area of the scan is text and what is graphics. You should be able to adjust how that process works, especially if you can be sure that each page scan will be in the same place as the rest of the book's pages.

Using the software that came with my scanner, you'd start each page with a "preview scan" and then draw a border around the area to select it for processing. I stopped using the scanner, because without cutting the spine of the book I wasn't getting a flat enough scan to drive the error rate down. The digital camera process works much better, and Omnipage 16 is new (as of last summer, anyway) so it's OCR is much better than my several year old scanner from HP.

mores
01-28-2008, 05:29 AM
I'm on a Mac - and would think TextMate the answer - but I have windows installed if anyone knows of any free windows software that can help. How do you scan OCR on a mac?
I mean, without paying $1k for omnipage (I think)

Other than that, I'd batch-crop the source-images too. You might also want to try the program "TextWrangler", which is free, that offers a lot of text-editing options.

Gideon
01-28-2008, 10:54 AM
As to OCR - I actually do it in Windows on my Mac. I have used OmniPage (and it's amazing) but only through work and I'm not terribly comfortble doing this at work.

Generally, I either use Adobe Acrobat's OCR (not the best) or I use the AbbyReader OCR that came with the OpticBook.

Now, the good news..... I found a way to do it.

This requires TextMate (but many other "pure text" programs may use it, this isn't something I know a lot about, but was able to get help on the TextMate news list.) It makes use of Regular Expressions - which, from what I gather, is exactly to what kovidgoyal suggested (regexps) except now I have the commands needed.

A very kind man named Peter Cowan broke it down for me:
Open the find window (command + F)
Set the find to the following (no extra spaces):
\n[0-9]{1,3}.+\n

Set replace to nothing

Be sure "Regular Expression" is checked. You can see if it's
highlighting the right part by clicking next, before replacing all.

I'll break it down for you:

\n -- match the new line before the number

[0-9]{1,3} -- match a number that is 1 to 3 digits long, if your book
has more pages than that you'll need to change this. This also won't
work for pages with roman numerals

.+ -- match one or more of any character until the end of the line,
this gets the spaces and author name.

\n -- match the newline after the above match.

Check out the tutorial linked from the Textmate help, Regular
Expressions are very handy.

Peter



And for reverse text (that is, the other side of the page "<title> <page number>" - you'd put in (I believe this is what I did!) ".+[0-9]{1,3}\n" (without quotes)

Hope this helps others.
UPDATE: Okay, that second command (the one I cobbled together) doesn't work entirely. It worked on some, but not others... and I have no idea why. If anyone has any suggestions, I'm all ears.

UPDATE, Part II - I did some looking around and found out some of my lines have spaces at the end. So using this ".+[0-9]{1,3}\s\n" I was able to get the rest smoothly.

kovidgoyal
01-28-2008, 01:26 PM
Nice you're taking your first baby steps into the world of regexps. You should know that that regexp will match any line in the file that starts with upto 3 numbers not just lines of the form

2 author name

You should probably make it more specific. Something along the lines of


\n\d{1,5}\s+author name