01-27-2008, 10:47 PM | #1 |
Wearer of Pants
Posts: 1,050
Karma: 7634
Join Date: Jan 2008
Location: Norman, OK
Device: Amazon Kindle DX / iPhone
|
eBook Text Editing
Okay.... so there is, no doubt, somewhere around here an answer to this. Unfortunately, I wouldn't begin to know how to search for it. So pardon the repetition.
As I've mentioned in other places, I've been scanning many of my books. Some of which (research/school) I like to keep the page numbers in for citation purposes, and the OCR process does that well enough. However, with fiction, this isn't working out too well. I can get rid of the hard returns and unwrap the text, but I've not worked out a way to remove the page number/title/author at the top of the scanned page in any efficient way. So, take this chunk, for instance: Code:
Ser Waymar Royce glanced at the sky with disinterest. "It does that every day about this time. Are you unmanned by the dark, Gared?" Will could see the tightness around Gared's mouth, the barely sup 2 <authors name> pressed anger in his eyes under the thick black hood of his cloak. Gared had spent forty years in the Night's Watch, man and boy, and he was not accustomed to being made light of. Yet it was more than that. Under the wounded pride, Will could sense something else in the older man. Yonervous tension that came perilous close to fear. I'm on a Mac - and would think TextMate the answer - but I have windows installed if anyone knows of any free windows software that can help. I also have an Ubuntu installation available - but I'm not so comfortable with linux as to like this option if I can avoid it. Thanks! |
01-27-2008, 11:28 PM | #2 |
Wearer of Pants
Posts: 1,050
Karma: 7634
Join Date: Jan 2008
Location: Norman, OK
Device: Amazon Kindle DX / iPhone
|
Heh... I wish. Time and experience have taught me that whatever brains I do have, I certainly can't hack it at any sort of programming.
Hell, even after years running a game development group I could, at best, read the script but couldn't script my own way out of a paper bag. As to the other suggestion... I've actually looked, and not been successful Also, as this issue will crop up from time to time, I'd really like to figure out a way around this. |
Advert | |
|
01-28-2008, 02:56 AM | #3 | |
space cadet
Posts: 331
Karma: 2963633
Join Date: Aug 2007
Location: Seattle area
Device: Rocket PRO, gen3, Pocketbook360
|
Quote:
I've been (slowly) scanning a couple of books with a home-made stand, a digital camera, and a copy of OmniPage 16 (on WinXP). Part of the process of the OCR involves manipulating the images before actually running the OCR. At that time, I adjust the boundaries of the boxes that indicate what part of the image will be processed, to exclude the page headers. Because I'm using a digital camera, I can't exactly focus the lens on ONLY the page, and there is always a border of some sort that includes the stand and the clamps I'm using to hold the page in place. So I need to adjust the borders anyway, and clipping the page numbers, etc. is easy. If you are not doing it manually, there must be some software mechanism to select what area of the scan is text and what is graphics. You should be able to adjust how that process works, especially if you can be sure that each page scan will be in the same place as the rest of the book's pages. Using the software that came with my scanner, you'd start each page with a "preview scan" and then draw a border around the area to select it for processing. I stopped using the scanner, because without cutting the spine of the book I wasn't getting a flat enough scan to drive the error rate down. The digital camera process works much better, and Omnipage 16 is new (as of last summer, anyway) so it's OCR is much better than my several year old scanner from HP. |
|
01-28-2008, 04:29 AM | #4 | |
Guru
Posts: 834
Karma: 102419
Join Date: Sep 2007
Location: Vienna, Austria
Device: iPhone
|
Quote:
I mean, without paying $1k for omnipage (I think) Other than that, I'd batch-crop the source-images too. You might also want to try the program "TextWrangler", which is free, that offers a lot of text-editing options. |
|
01-28-2008, 09:54 AM | #5 | |
Wearer of Pants
Posts: 1,050
Karma: 7634
Join Date: Jan 2008
Location: Norman, OK
Device: Amazon Kindle DX / iPhone
|
As to OCR - I actually do it in Windows on my Mac. I have used OmniPage (and it's amazing) but only through work and I'm not terribly comfortble doing this at work.
Generally, I either use Adobe Acrobat's OCR (not the best) or I use the AbbyReader OCR that came with the OpticBook. Now, the good news..... I found a way to do it. This requires TextMate (but many other "pure text" programs may use it, this isn't something I know a lot about, but was able to get help on the TextMate news list.) It makes use of Regular Expressions - which, from what I gather, is exactly to what kovidgoyal suggested (regexps) except now I have the commands needed. A very kind man named Peter Cowan broke it down for me: Quote:
Hope this helps others. UPDATE: Okay, that second command (the one I cobbled together) doesn't work entirely. It worked on some, but not others... and I have no idea why. If anyone has any suggestions, I'm all ears. UPDATE, Part II - I did some looking around and found out some of my lines have spaces at the end. So using this ".+[0-9]{1,3}\s\n" I was able to get the rest smoothly. Last edited by Gideon; 01-28-2008 at 10:20 AM. Reason: update info |
|
Advert | |
|
01-28-2008, 12:26 PM | #6 |
creator of calibre
Posts: 44,356
Karma: 23661992
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Nice you're taking your first baby steps into the world of regexps. You should know that that regexp will match any line in the file that starts with upto 3 numbers not just lines of the form
2 author name You should probably make it more specific. Something along the lines of Code:
\n\d{1,5}\s+author name |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
New to ebook editing. Need help with cover images please. | bfollowell | Workshop | 3 | 09-09-2010 07:24 PM |
Ebook editing | avivit02 | Calibre | 3 | 06-25-2010 01:33 AM |
Editing ebook authors | ultimoed | LRF | 1 | 08-12-2009 09:38 AM |
Do need help editing text files? | Nate the great | Workshop | 3 | 04-01-2009 01:18 PM |
japi - a text editor capable of editing ePub directly | hekkel | ePub | 5 | 02-20-2009 08:46 AM |