Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Formats > Workshop

Notices

Reply
 
Thread Tools Search this Thread
Old 08-28-2009, 09:15 AM   #1
jedavis1
Zealot
jedavis1 doesn't litterjedavis1 doesn't litter
 
jedavis1's Avatar
 
Posts: 103
Karma: 148
Join Date: Aug 2008
Location: Huntington, IN US
Device: Sony PRS-505
Converting OCR Text files

I have some text files that I have created from pbooks that I have scanned and ran OCR. Is there a MS Word template that can take the page numbers out and "optimize" the text for conversion to epub or lrf? I remember reading about someone that had a Word template that has a VBScript that did some modifying but I can't seem to find it. Right now I am doing it manually as I get ready to read them, but it is a pain in the carple-tunnels to do it. Any suggestions?
jedavis1 is offline   Reply With Quote
Old 08-28-2009, 09:42 AM   #2
zelda_pinwheel
zeldinha zippy zeldissima
zelda_pinwheel ought to be getting tired of karma fortunes by now.zelda_pinwheel ought to be getting tired of karma fortunes by now.zelda_pinwheel ought to be getting tired of karma fortunes by now.zelda_pinwheel ought to be getting tired of karma fortunes by now.zelda_pinwheel ought to be getting tired of karma fortunes by now.zelda_pinwheel ought to be getting tired of karma fortunes by now.zelda_pinwheel ought to be getting tired of karma fortunes by now.zelda_pinwheel ought to be getting tired of karma fortunes by now.zelda_pinwheel ought to be getting tired of karma fortunes by now.zelda_pinwheel ought to be getting tired of karma fortunes by now.zelda_pinwheel ought to be getting tired of karma fortunes by now.
 
zelda_pinwheel's Avatar
 
Posts: 27,828
Karma: 908606
Join Date: Dec 2007
Location: Paris, France
Device: eb1150 & is that a nook in her pocket, or she just happy to see you?
hi jedavis, i can't help you with your problem, but i'm going to move your thread to the workshop forum where i am sure one of our resident experts will have some advice for you.
zelda_pinwheel is offline   Reply With Quote
Old 08-28-2009, 11:00 AM   #3
ekaser
Opinion Artiste
ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.
 
ekaser's Avatar
 
Posts: 300
Karma: 61462
Join Date: Mar 2009
Location: Albany, OR
Device: Motorola Xoom, Android phone, Kindle Touch, Kindle Fire
It doesn't help you with your problem, but I avoid the issue by how I scan in. I use the free version of Aabbe OCR software (not 100% sure of that name, as I'm not on my home machine at the moment). It uses the scanner's native scanning app to scan in the pages. I scan the first page, then start scanning a second page. In the time it takes the scanner to scan the second page, I can mark (outline) the text on the first page that I want converted (which excludes the page number and the header/footer), and have the OCR software 'read' it. Then flip the page, start the third page scanning, and mark the desired text on the 2nd page and read it. At the end, you have the entire text OCR'd minus the headers/footers/page#'s and in basically the same amount of time as it takes to scan WITH those items included.

When scanning a paperback book, two facing pages are scanned at once. I save the OCR'd text as an RTF and load it into Open Office. The facing pages come up as two columns of text on each page. It's then a pretty easy process to convert everything to one column and get rid of the extraneous page breaks, and you're now ready to proofread and convert to whatever format you like. I like saving it at that point to HTML as a "base format" from which you can then convert to whatever you like, but your mileage will undoubtedly vary...
ekaser is offline   Reply With Quote
Old 08-28-2009, 01:37 PM   #4
igorsk
Wizard
igorsk reads XML... blindfoldedigorsk reads XML... blindfoldedigorsk reads XML... blindfoldedigorsk reads XML... blindfoldedigorsk reads XML... blindfoldedigorsk reads XML... blindfoldedigorsk reads XML... blindfoldedigorsk reads XML... blindfoldedigorsk reads XML... blindfoldedigorsk reads XML... blindfoldedigorsk reads XML... blindfolded
 
Posts: 3,443
Karma: 52235
Join Date: Sep 2006
Location: Belgium
Device: PRS-500/505/700, Kindle, Cybook Gen3, Words Gear
The program is ABBYY FineReader and it's probably the best OCR software available.
igorsk is offline   Reply With Quote
Old 08-28-2009, 03:28 PM   #5
wayrad
Fanatic
wayrad ought to be getting tired of karma fortunes by now.wayrad ought to be getting tired of karma fortunes by now.wayrad ought to be getting tired of karma fortunes by now.wayrad ought to be getting tired of karma fortunes by now.wayrad ought to be getting tired of karma fortunes by now.wayrad ought to be getting tired of karma fortunes by now.wayrad ought to be getting tired of karma fortunes by now.wayrad ought to be getting tired of karma fortunes by now.wayrad ought to be getting tired of karma fortunes by now.wayrad ought to be getting tired of karma fortunes by now.wayrad ought to be getting tired of karma fortunes by now.
 
Posts: 547
Karma: 1121392
Join Date: May 2008
Location: USA
Device: Galaxy Nexus
I just scan and OCR normally, then tell Finereader to save to Word format and under "save options" I tell it not to include headers or footers...bingo, no more page numbers or running titles (except the occasional one that gets left in).

Last edited by wayrad; 08-28-2009 at 03:32 PM.
wayrad is offline   Reply With Quote
Old 09-01-2009, 03:17 PM   #6
Kino
Enthusiast
Kino began at the beginning.
 
Posts: 41
Karma: 10
Join Date: May 2009
Device: CyBook
I have also been looking for a similar "template" that would magically take away most of the effort involved... it's an elusive beast and is essentially a collection of Word macros which do most of the tidying up.

I've given up trying to find a single solution and my routine involves using find/replace to:

1. remove excess spaces
2. remove tabs
3. remove manual line breaks
4. remove optional hyphens
5. remove section breaks

I then select the whole text and using font\format set the font spacing to normal.

Then I save as an rtf file, load into Wordpad and change the whole font in one go.

Re-loading back into Word it is then ready for a spell-check and any obvious page-break errors.

After scanning and ocr, the formatting doesn't take that long.

I then convert to mobipocket and read it with my Cybook, making corrections in Word at regular intervals, finally making a fully proof-read copy.
Kino is offline   Reply With Quote
Old 09-01-2009, 04:35 PM   #7
Sparrow
Wizard
Sparrow ought to be getting tired of karma fortunes by now.Sparrow ought to be getting tired of karma fortunes by now.Sparrow ought to be getting tired of karma fortunes by now.Sparrow ought to be getting tired of karma fortunes by now.Sparrow ought to be getting tired of karma fortunes by now.Sparrow ought to be getting tired of karma fortunes by now.Sparrow ought to be getting tired of karma fortunes by now.Sparrow ought to be getting tired of karma fortunes by now.Sparrow ought to be getting tired of karma fortunes by now.Sparrow ought to be getting tired of karma fortunes by now.Sparrow ought to be getting tired of karma fortunes by now.
 
Posts: 4,400
Karma: 1358102
Join Date: Nov 2007
Location: UK
Device: Palm TX, CyBook Gen3
Depending what the OCRd text looks like, maybe load it into Book Designer and use the 'Short Paragraph' option in Element Browser to find the candidate lines - select the ones you don't want and use the Delete option to get rid of them.
Sparrow is offline   Reply With Quote
Old 09-04-2009, 06:53 AM   #8
jedavis1
Zealot
jedavis1 doesn't litterjedavis1 doesn't litter
 
jedavis1's Avatar
 
Posts: 103
Karma: 148
Join Date: Aug 2008
Location: Huntington, IN US
Device: Sony PRS-505
Thanks for all the help. Kino's scenario is about the closest to what I am manually doing right now. I am looking at a macro that can do all the above, but I am not holding out hope.
jedavis1 is offline   Reply With Quote
Old 09-05-2009, 02:04 PM   #9
Kino
Enthusiast
Kino began at the beginning.
 
Posts: 41
Karma: 10
Join Date: May 2009
Device: CyBook
Actually, that's not all that difficult. You can record a macro in Word which simply records whatever you do... the problem then is editting the macro so that it becomes a bit more generic.

Here's an example to remove excess spaces:

Sub RemoveWhiteSpaces()
Selection.WholeStory
Selection.Find.ClearFormatting
Selection.Find.Replacement.ClearFormatting
With Selection.Find
.Text = "^w"
.Replacement.Text = " "
.Forward = True
.Wrap = wdFindAsk
.Format = False
.MatchCase = False
.MatchWholeWord = False
.MatchKashida = False
.MatchDiacritics = False
.MatchAlefHamza = False
.MatchControl = False
.MatchByte = False
.MatchAllWordForms = False
.MatchSoundsLike = False
.MatchWildcards = False
.MatchFuzzy = False
End With
Selection.Find.Execute Replace:=wdReplaceAll
End Sub

Last edited by Kino; 09-05-2009 at 02:42 PM. Reason: add macro
Kino is offline   Reply With Quote
Old 09-07-2009, 05:02 AM   #10
MikeB1972
Evangelist
MikeB1972 ought to be getting tired of karma fortunes by now.MikeB1972 ought to be getting tired of karma fortunes by now.MikeB1972 ought to be getting tired of karma fortunes by now.MikeB1972 ought to be getting tired of karma fortunes by now.MikeB1972 ought to be getting tired of karma fortunes by now.MikeB1972 ought to be getting tired of karma fortunes by now.MikeB1972 ought to be getting tired of karma fortunes by now.MikeB1972 ought to be getting tired of karma fortunes by now.MikeB1972 ought to be getting tired of karma fortunes by now.MikeB1972 ought to be getting tired of karma fortunes by now.MikeB1972 ought to be getting tired of karma fortunes by now.
 
Posts: 489
Karma: 2099219
Join Date: Jul 2009
Location: UK
Device: BeBook,JetBook Lite,PRS-300-350-505-650,+ran out of space to type
Hi All

I expanded the idea out into a stand alone program (currently only deals with rtf / txt files)

File attached (reflow.exe), it should be fairly straightforward, please have a play and let me know any feedback / questions / requests.

Regards

MikeB
Attached Files
File Type: zip Reflow.zip (22.8 KB, 99 views)
MikeB1972 is offline   Reply With Quote
Old 10-01-2009, 10:09 PM   #11
ascherjim
Addict
ascherjim has a complete set of Star Wars action figures.ascherjim has a complete set of Star Wars action figures.ascherjim has a complete set of Star Wars action figures.
 
Posts: 257
Karma: 274
Join Date: Apr 2006
Location: Seattle
Device: BeBook One, PocketBook 360. Nokia N800
Quote:
Originally Posted by MikeB1972 View Post
Hi All

I expanded the idea out into a stand alone program (currently only deals with rtf / txt files)

File attached (reflow.exe), it should be fairly straightforward, please have a play and let me know any feedback / questions / requests.

Regards

MikeB
Mike: I tried to open this (after unzipping it) but an error message comes up saying that some component is missing. Stupid of me, but I don't remember which. Does it still work for you as is? Thanks.
ascherjim is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Google Adds OCR for PDF Files kjk News 0 06-22-2010 02:27 PM
Converting certain punctuations from text to epub greenapple Calibre 2 05-28-2010 08:27 PM
PDF Image -> OCR -> text frikk Workshop 9 07-08-2009 07:21 PM
Text becomes center-aligned when converting Alfy Calibre 12 01-19-2009 12:41 AM
Text tool for formatting Gutenberg text files bob_ninja Workshop 5 11-13-2007 12:28 PM


All times are GMT -4. The time now is 05:05 AM.


MobileRead.com is a privately owned, operated and funded community.