View Single Post
Old 01-28-2008, 02:56 AM   #3
Darqref
space cadet
Darqref ought to be getting tired of karma fortunes by now.Darqref ought to be getting tired of karma fortunes by now.Darqref ought to be getting tired of karma fortunes by now.Darqref ought to be getting tired of karma fortunes by now.Darqref ought to be getting tired of karma fortunes by now.Darqref ought to be getting tired of karma fortunes by now.Darqref ought to be getting tired of karma fortunes by now.Darqref ought to be getting tired of karma fortunes by now.Darqref ought to be getting tired of karma fortunes by now.Darqref ought to be getting tired of karma fortunes by now.Darqref ought to be getting tired of karma fortunes by now.
 
Posts: 334
Karma: 2999999
Join Date: Aug 2007
Location: Seattle area
Device: Rocket PRO, gen3, Pocketbook360
Quote:
Originally Posted by Gideon View Post
As I've mentioned in other places, I've been scanning many of my books. Some of which (research/school) I like to keep the page numbers in for citation purposes, and the OCR process does that well enough.

However, with fiction, this isn't working out too well. I can get rid of the hard returns and unwrap the text, but I've not worked out a way to remove the page number/title/author at the top of the scanned page in any efficient way.

(snip sample)

So I need a way to remove the the page number line on every page, and remove the hard return so that I can then reflow the text normally.

I'm on a Mac - and would think TextMate the answer - but I have windows installed if anyone knows of any free windows software that can help.

I also have an Ubuntu installation available - but I'm not so comfortable with linux as to like this option if I can avoid it.
Thanks!
How are you scanning your source? What software are you using for OCR?

I've been (slowly) scanning a couple of books with a home-made stand, a digital camera, and a copy of OmniPage 16 (on WinXP). Part of the process of the OCR involves manipulating the images before actually running the OCR. At that time, I adjust the boundaries of the boxes that indicate what part of the image will be processed, to exclude the page headers. Because I'm using a digital camera, I can't exactly focus the lens on ONLY the page, and there is always a border of some sort that includes the stand and the clamps I'm using to hold the page in place. So I need to adjust the borders anyway, and clipping the page numbers, etc. is easy.

If you are not doing it manually, there must be some software mechanism to select what area of the scan is text and what is graphics. You should be able to adjust how that process works, especially if you can be sure that each page scan will be in the same place as the rest of the book's pages.

Using the software that came with my scanner, you'd start each page with a "preview scan" and then draw a border around the area to select it for processing. I stopped using the scanner, because without cutting the spine of the book I wasn't getting a flat enough scan to drive the error rate down. The digital camera process works much better, and Omnipage 16 is new (as of last summer, anyway) so it's OCR is much better than my several year old scanner from HP.
Darqref is offline   Reply With Quote