Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Formats > Workshop

Notices

Reply
 
Thread Tools Search this Thread
Old 01-27-2008, 10:47 PM   #1
Gideon
Wearer of Pants
Gideon knows the square root of minus one.Gideon knows the square root of minus one.Gideon knows the square root of minus one.Gideon knows the square root of minus one.Gideon knows the square root of minus one.Gideon knows the square root of minus one.Gideon knows the square root of minus one.Gideon knows the square root of minus one.Gideon knows the square root of minus one.Gideon knows the square root of minus one.Gideon knows the square root of minus one.
 
Gideon's Avatar
 
Posts: 1,050
Karma: 7634
Join Date: Jan 2008
Location: Norman, OK
Device: Amazon Kindle DX / iPhone
eBook Text Editing

Okay.... so there is, no doubt, somewhere around here an answer to this. Unfortunately, I wouldn't begin to know how to search for it. So pardon the repetition.

As I've mentioned in other places, I've been scanning many of my books. Some of which (research/school) I like to keep the page numbers in for citation purposes, and the OCR process does that well enough.

However, with fiction, this isn't working out too well. I can get rid of the hard returns and unwrap the text, but I've not worked out a way to remove the page number/title/author at the top of the scanned page in any efficient way.

So, take this chunk, for instance:

Code:
Ser Waymar Royce glanced at the sky with disinterest. "It does that every
day 
about this time. Are you unmanned by the dark, Gared?" 
Will could see the tightness around Gared's mouth, the barely sup 
2  <authors name>
pressed anger in his eyes under the thick black hood of his cloak. Gared had 
spent forty years in the Night's Watch, man and boy, and he was not
accustomed 
to being made light of. Yet it was more than that. Under the wounded pride, 
Will could sense something else in the older man. Yonervous tension that
came perilous close to fear.
So I need a way to remove the the page number line on every page, and remove the hard return so that I can then reflow the text normally.

I'm on a Mac - and would think TextMate the answer - but I have windows installed if anyone knows of any free windows software that can help.

I also have an Ubuntu installation available - but I'm not so comfortable with linux as to like this option if I can avoid it.
Thanks!
Gideon is offline   Reply With Quote
Old 01-27-2008, 11:28 PM   #2
Gideon
Wearer of Pants
Gideon knows the square root of minus one.Gideon knows the square root of minus one.Gideon knows the square root of minus one.Gideon knows the square root of minus one.Gideon knows the square root of minus one.Gideon knows the square root of minus one.Gideon knows the square root of minus one.Gideon knows the square root of minus one.Gideon knows the square root of minus one.Gideon knows the square root of minus one.Gideon knows the square root of minus one.
 
Gideon's Avatar
 
Posts: 1,050
Karma: 7634
Join Date: Jan 2008
Location: Norman, OK
Device: Amazon Kindle DX / iPhone
Heh... I wish. Time and experience have taught me that whatever brains I do have, I certainly can't hack it at any sort of programming.

Hell, even after years running a game development group I could, at best, read the script but couldn't script my own way out of a paper bag.

As to the other suggestion... I've actually looked, and not been successful Also, as this issue will crop up from time to time, I'd really like to figure out a way around this.
Gideon is offline   Reply With Quote
Old 01-28-2008, 02:56 AM   #3
Darqref
space cadet
Darqref ought to be getting tired of karma fortunes by now.Darqref ought to be getting tired of karma fortunes by now.Darqref ought to be getting tired of karma fortunes by now.Darqref ought to be getting tired of karma fortunes by now.Darqref ought to be getting tired of karma fortunes by now.Darqref ought to be getting tired of karma fortunes by now.Darqref ought to be getting tired of karma fortunes by now.Darqref ought to be getting tired of karma fortunes by now.Darqref ought to be getting tired of karma fortunes by now.Darqref ought to be getting tired of karma fortunes by now.Darqref ought to be getting tired of karma fortunes by now.
 
Posts: 330
Karma: 2963633
Join Date: Aug 2007
Location: Seattle area
Device: Rocket PRO, gen3, Pocketbook360
Quote:
Originally Posted by Gideon View Post
As I've mentioned in other places, I've been scanning many of my books. Some of which (research/school) I like to keep the page numbers in for citation purposes, and the OCR process does that well enough.

However, with fiction, this isn't working out too well. I can get rid of the hard returns and unwrap the text, but I've not worked out a way to remove the page number/title/author at the top of the scanned page in any efficient way.

(snip sample)

So I need a way to remove the the page number line on every page, and remove the hard return so that I can then reflow the text normally.

I'm on a Mac - and would think TextMate the answer - but I have windows installed if anyone knows of any free windows software that can help.

I also have an Ubuntu installation available - but I'm not so comfortable with linux as to like this option if I can avoid it.
Thanks!
How are you scanning your source? What software are you using for OCR?

I've been (slowly) scanning a couple of books with a home-made stand, a digital camera, and a copy of OmniPage 16 (on WinXP). Part of the process of the OCR involves manipulating the images before actually running the OCR. At that time, I adjust the boundaries of the boxes that indicate what part of the image will be processed, to exclude the page headers. Because I'm using a digital camera, I can't exactly focus the lens on ONLY the page, and there is always a border of some sort that includes the stand and the clamps I'm using to hold the page in place. So I need to adjust the borders anyway, and clipping the page numbers, etc. is easy.

If you are not doing it manually, there must be some software mechanism to select what area of the scan is text and what is graphics. You should be able to adjust how that process works, especially if you can be sure that each page scan will be in the same place as the rest of the book's pages.

Using the software that came with my scanner, you'd start each page with a "preview scan" and then draw a border around the area to select it for processing. I stopped using the scanner, because without cutting the spine of the book I wasn't getting a flat enough scan to drive the error rate down. The digital camera process works much better, and Omnipage 16 is new (as of last summer, anyway) so it's OCR is much better than my several year old scanner from HP.
Darqref is offline   Reply With Quote
Old 01-28-2008, 04:29 AM   #4
mores
Guru
mores knows the air-speed velocity of an unladen swallow.mores knows the air-speed velocity of an unladen swallow.mores knows the air-speed velocity of an unladen swallow.mores knows the air-speed velocity of an unladen swallow.mores knows the air-speed velocity of an unladen swallow.mores knows the air-speed velocity of an unladen swallow.mores knows the air-speed velocity of an unladen swallow.mores knows the air-speed velocity of an unladen swallow.mores knows the air-speed velocity of an unladen swallow.mores knows the air-speed velocity of an unladen swallow.mores knows the air-speed velocity of an unladen swallow.
 
mores's Avatar
 
Posts: 834
Karma: 102419
Join Date: Sep 2007
Location: Vienna, Austria
Device: iPhone
Quote:
Originally Posted by Gideon View Post
I'm on a Mac - and would think TextMate the answer - but I have windows installed if anyone knows of any free windows software that can help.
How do you scan OCR on a mac?
I mean, without paying $1k for omnipage (I think)

Other than that, I'd batch-crop the source-images too. You might also want to try the program "TextWrangler", which is free, that offers a lot of text-editing options.
mores is offline   Reply With Quote
Old 01-28-2008, 09:54 AM   #5
Gideon
Wearer of Pants
Gideon knows the square root of minus one.Gideon knows the square root of minus one.Gideon knows the square root of minus one.Gideon knows the square root of minus one.Gideon knows the square root of minus one.Gideon knows the square root of minus one.Gideon knows the square root of minus one.Gideon knows the square root of minus one.Gideon knows the square root of minus one.Gideon knows the square root of minus one.Gideon knows the square root of minus one.
 
Gideon's Avatar
 
Posts: 1,050
Karma: 7634
Join Date: Jan 2008
Location: Norman, OK
Device: Amazon Kindle DX / iPhone
As to OCR - I actually do it in Windows on my Mac. I have used OmniPage (and it's amazing) but only through work and I'm not terribly comfortble doing this at work.

Generally, I either use Adobe Acrobat's OCR (not the best) or I use the AbbyReader OCR that came with the OpticBook.

Now, the good news..... I found a way to do it.

This requires TextMate (but many other "pure text" programs may use it, this isn't something I know a lot about, but was able to get help on the TextMate news list.) It makes use of Regular Expressions - which, from what I gather, is exactly to what kovidgoyal suggested (regexps) except now I have the commands needed.

A very kind man named Peter Cowan broke it down for me:
Quote:
Open the find window (command + F)
Set the find to the following (no extra spaces):
\n[0-9]{1,3}.+\n

Set replace to nothing

Be sure "Regular Expression" is checked. You can see if it's
highlighting the right part by clicking next, before replacing all.

I'll break it down for you:

\n -- match the new line before the number

[0-9]{1,3} -- match a number that is 1 to 3 digits long, if your book
has more pages than that you'll need to change this. This also won't
work for pages with roman numerals

.+ -- match one or more of any character until the end of the line,
this gets the spaces and author name.

\n -- match the newline after the above match.

Check out the tutorial linked from the Textmate help, Regular
Expressions are very handy.

Peter
And for reverse text (that is, the other side of the page "<title> <page number>" - you'd put in (I believe this is what I did!) ".+[0-9]{1,3}\n" (without quotes)

Hope this helps others.
UPDATE: Okay, that second command (the one I cobbled together) doesn't work entirely. It worked on some, but not others... and I have no idea why. If anyone has any suggestions, I'm all ears.

UPDATE, Part II - I did some looking around and found out some of my lines have spaces at the end. So using this ".+[0-9]{1,3}\s\n" I was able to get the rest smoothly.

Last edited by Gideon; 01-28-2008 at 10:20 AM. Reason: update info
Gideon is offline   Reply With Quote
Old 01-28-2008, 12:26 PM   #6
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,844
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Nice you're taking your first baby steps into the world of regexps. You should know that that regexp will match any line in the file that starts with upto 3 numbers not just lines of the form

2 author name

You should probably make it more specific. Something along the lines of

Code:
\n\d{1,5}\s+author name
kovidgoyal is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
New to ebook editing. Need help with cover images please. bfollowell Workshop 3 09-09-2010 07:24 PM
Ebook editing avivit02 Calibre 3 06-25-2010 01:33 AM
Editing ebook authors ultimoed LRF 1 08-12-2009 09:38 AM
Do need help editing text files? Nate the great Workshop 3 04-01-2009 01:18 PM
japi - a text editor capable of editing ePub directly hekkel ePub 5 02-20-2009 08:46 AM


All times are GMT -4. The time now is 11:24 PM.


MobileRead.com is a privately owned, operated and funded community.