08-28-2009, 09:15 AM | #1 |
Zealot
Posts: 103
Karma: 148
Join Date: Aug 2008
Location: Huntington, IN US
Device: Sony PRS-505
|
Converting OCR Text files
I have some text files that I have created from pbooks that I have scanned and ran OCR. Is there a MS Word template that can take the page numbers out and "optimize" the text for conversion to epub or lrf? I remember reading about someone that had a Word template that has a VBScript that did some modifying but I can't seem to find it. Right now I am doing it manually as I get ready to read them, but it is a pain in the carple-tunnels to do it. Any suggestions?
|
08-28-2009, 09:42 AM | #2 |
zeldinha zippy zeldissima
Posts: 27,827
Karma: 921169
Join Date: Dec 2007
Location: Paris, France
Device: eb1150 & is that a nook in her pocket, or she just happy to see you?
|
hi jedavis, i can't help you with your problem, but i'm going to move your thread to the workshop forum where i am sure one of our resident experts will have some advice for you.
|
Advert | |
|
08-28-2009, 11:00 AM | #3 |
Opinion Artiste
Posts: 301
Karma: 61464
Join Date: Mar 2009
Location: Albany, OR
Device: Nexus 5, Nexus 7, Kindle Touch, Kindle Fire
|
It doesn't help you with your problem, but I avoid the issue by how I scan in. I use the free version of Aabbe OCR software (not 100% sure of that name, as I'm not on my home machine at the moment). It uses the scanner's native scanning app to scan in the pages. I scan the first page, then start scanning a second page. In the time it takes the scanner to scan the second page, I can mark (outline) the text on the first page that I want converted (which excludes the page number and the header/footer), and have the OCR software 'read' it. Then flip the page, start the third page scanning, and mark the desired text on the 2nd page and read it. At the end, you have the entire text OCR'd minus the headers/footers/page#'s and in basically the same amount of time as it takes to scan WITH those items included.
When scanning a paperback book, two facing pages are scanned at once. I save the OCR'd text as an RTF and load it into Open Office. The facing pages come up as two columns of text on each page. It's then a pretty easy process to convert everything to one column and get rid of the extraneous page breaks, and you're now ready to proofread and convert to whatever format you like. I like saving it at that point to HTML as a "base format" from which you can then convert to whatever you like, but your mileage will undoubtedly vary... |
08-28-2009, 01:37 PM | #4 |
Wizard
Posts: 3,442
Karma: 300001
Join Date: Sep 2006
Location: Belgium
Device: PRS-500/505/700, Kindle, Cybook Gen3, Words Gear
|
The program is ABBYY FineReader and it's probably the best OCR software available.
|
08-28-2009, 03:28 PM | #5 |
Fanatic
Posts: 551
Karma: 1121392
Join Date: May 2008
Location: USA
Device: HTC One M8
|
I just scan and OCR normally, then tell Finereader to save to Word format and under "save options" I tell it not to include headers or footers...bingo, no more page numbers or running titles (except the occasional one that gets left in).
Last edited by wayrad; 08-28-2009 at 03:32 PM. |
Advert | |
|
09-01-2009, 03:17 PM | #6 |
Enthusiast
Posts: 41
Karma: 10
Join Date: May 2009
Device: CyBook
|
I have also been looking for a similar "template" that would magically take away most of the effort involved... it's an elusive beast and is essentially a collection of Word macros which do most of the tidying up.
I've given up trying to find a single solution and my routine involves using find/replace to: 1. remove excess spaces 2. remove tabs 3. remove manual line breaks 4. remove optional hyphens 5. remove section breaks I then select the whole text and using font\format set the font spacing to normal. Then I save as an rtf file, load into Wordpad and change the whole font in one go. Re-loading back into Word it is then ready for a spell-check and any obvious page-break errors. After scanning and ocr, the formatting doesn't take that long. I then convert to mobipocket and read it with my Cybook, making corrections in Word at regular intervals, finally making a fully proof-read copy. |
09-01-2009, 04:35 PM | #7 |
Wizard
Posts: 4,395
Karma: 1358132
Join Date: Nov 2007
Location: UK
Device: Palm TX, CyBook Gen3
|
Depending what the OCRd text looks like, maybe load it into Book Designer and use the 'Short Paragraph' option in Element Browser to find the candidate lines - select the ones you don't want and use the Delete option to get rid of them.
|
09-04-2009, 06:53 AM | #8 |
Zealot
Posts: 103
Karma: 148
Join Date: Aug 2008
Location: Huntington, IN US
Device: Sony PRS-505
|
Thanks for all the help. Kino's scenario is about the closest to what I am manually doing right now. I am looking at a macro that can do all the above, but I am not holding out hope.
|
09-05-2009, 02:04 PM | #9 |
Enthusiast
Posts: 41
Karma: 10
Join Date: May 2009
Device: CyBook
|
Actually, that's not all that difficult. You can record a macro in Word which simply records whatever you do... the problem then is editting the macro so that it becomes a bit more generic.
Here's an example to remove excess spaces: Sub RemoveWhiteSpaces() Selection.WholeStory Selection.Find.ClearFormatting Selection.Find.Replacement.ClearFormatting With Selection.Find .Text = "^w" .Replacement.Text = " " .Forward = True .Wrap = wdFindAsk .Format = False .MatchCase = False .MatchWholeWord = False .MatchKashida = False .MatchDiacritics = False .MatchAlefHamza = False .MatchControl = False .MatchByte = False .MatchAllWordForms = False .MatchSoundsLike = False .MatchWildcards = False .MatchFuzzy = False End With Selection.Find.Execute Replace:=wdReplaceAll End Sub Last edited by Kino; 09-05-2009 at 02:42 PM. Reason: add macro |
09-07-2009, 05:02 AM | #10 |
Gnu
Posts: 1,222
Karma: 15625359
Join Date: Jul 2009
Location: UK
Device: BeBook,JetBook Lite,PRS-300-350-505-650,+ran out of space to type
|
Hi All
I expanded the idea out into a stand alone program (currently only deals with rtf / txt files) File attached (reflow.exe), it should be fairly straightforward, please have a play and let me know any feedback / questions / requests. Regards MikeB |
10-01-2009, 10:09 PM | #11 |
Addict
Posts: 260
Karma: 274
Join Date: Apr 2006
Location: Gig Harbor, Washington
Device: BeBook One, PocketBook 360, Kindle Paperwhite, Kobo Aura One
|
Mike: I tried to open this (after unzipping it) but an error message comes up saying that some component is missing. Stupid of me, but I don't remember which. Does it still work for you as is? Thanks.
|
Thread Tools | Search this Thread |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Google Adds OCR for PDF Files | kjk | News | 0 | 06-22-2010 02:27 PM |
Converting certain punctuations from text to epub | greenapple | Calibre | 2 | 05-28-2010 08:27 PM |
PDF Image -> OCR -> text | frikk | Workshop | 9 | 07-08-2009 07:21 PM |
Text becomes center-aligned when converting | Alfy | Calibre | 12 | 01-19-2009 12:41 AM |
Text tool for formatting Gutenberg text files | bob_ninja | Workshop | 5 | 11-13-2007 12:28 PM |