Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Formats > Workshop

Notices

Reply
 
Thread Tools Search this Thread
Old 04-02-2010, 09:31 AM   #1
71117c
Junior Member
71117c began at the beginning.
 
Posts: 5
Karma: 10
Join Date: Mar 2010
Device: PB 301
2 questions concerning the scanning novels

hi,

i recently thought of buying a scanner and use destructive scanning (i.e. cut the binding) to convert novels into epub. to get an idea how tedious this process would be, i converted books, that i have in pdf format on my pc, to tiff and then ran some linux ocr software on them.
i encountered two issues, which would slow down the conversion process tremendously, if i can't solve them:

1) detection of italic fonts.
2) detection of paragraphs: the ocr software i was using, detected paragraphs within a page fine. but since it operated on a single page, it couldn't recognise, if the last sentence on a page, that ended with a period there, was also the end of a paragraph or not.

is there any ocr software (windows or linux), that could reliable handle those two problems?
1) is "only" an ocr problem, but for 2) i would need something like: last sentence on a page ends with a period. -> check if first sentence on the next page is indented. if so -> new paragraph.

cheers 71117c
71117c is offline   Reply With Quote
Old 04-02-2010, 11:49 AM   #2
frabjous
Wizard
frabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameter
 
frabjous's Avatar
 
Posts: 1,213
Karma: 12890
Join Date: Feb 2009
Location: Amherst, Massachusetts, USA
Device: Sony PRS-505
Out of curiousity, what OCR program were you using?

I've been meaning to try out various OCR options for linux (Tessaract, Cuneiform, Ocropus, etc.), but just haven't found time for it. If I get around to it, I'd be happy to swap notes.

A lot of people around here rave about ABBYY finereader (Windows) for OCR, and I'd imagine it can handle both these desiderata, though as an open source enthusiast, I always like to see what other options are available before moving to commercial tools.
frabjous is offline   Reply With Quote
 
Enthusiast
Old 04-02-2010, 12:42 PM   #3
71117c
Junior Member
71117c began at the beginning.
 
Posts: 5
Karma: 10
Join Date: Mar 2010
Device: PB 301
Quote:
Originally Posted by frabjous View Post
Out of curiousity, what OCR program were you using?

I've been meaning to try out various OCR options for linux (Tessaract, Cuneiform, Ocropus, etc.), but just haven't found time for it. If I get around to it, I'd be happy to swap notes.
i've been using cuneiform, since tesseract only produces plain text and i couldn't get ocropus to work. had problems compiling it on archlinux.

my workflow was as follows:
[.) convert the pdf to tiff with ghostscript]
.) run cuneiform on the tiffs -> html output. one file per tiff.
.) merged the html files into a single xhtml file.
.) used vim to remove page numbers (if any), remove hyphenations,...
.) and finally pasted the file into sigil to insert a cover, add chapters, ...

character recognition of cuneiform is pretty good. sometimes italic characters weren't detected as such (so no <i> tags around them in the html output) and also the problem mentioned in my first post concerning paragraphs that span over two pages (but this is due to the way cuneiform operates ...)


Quote:
Originally Posted by frabjous View Post
A lot of people around here rave about ABBYY finereader (Windows) for OCR, and I'd imagine it can handle both these desiderata, though as an open source enthusiast, I always like to see what other options are available before moving to commercial tools.
there is even a commandline tool from abbyy for linux (http://www.ocr4linux.com/). a trial version can be downloaded, but i haven't tried it yet. i also prefer open source ...
71117c is offline   Reply With Quote
Old 04-02-2010, 02:17 PM   #4
pepak
Fanatic
pepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura about
 
Posts: 594
Karma: 4150
Join Date: Mar 2008
Device: Sony Reader PRS-505
Quote:
Originally Posted by 71117c View Post
to get an idea how tedious this process would be, i converted books, that i have in pdf format on my pc, to tiff and then ran some linux ocr software on them.
i encountered two issues, which would slow down the conversion process tremendously, if i can't solve them:
And here you are at the core of the matter: Scanning a book is easy enough, even a non-destructive scan goes quite quickly. But getting a usable text output out of scanned images is a lot of work - and not only those two issues you mention.

Quote:
1) detection of italic fonts.
That must be a problem with your OCR software. FineReader family of OCRs handle italics quite well (they get worse error rate on italics than on regular fonts, but still the results are good enough).

Quote:
2) detection of paragraphs: the ocr software i was using, detected paragraphs within a page fine. but since it operated on a single page, it couldn't recognise, if the last sentence on a page, that ended with a period there, was also the end of a paragraph or not.
I don't know of any software that would handle the described situation well. I consider splitting and rejoining of paragraphs a necessary part of the proofing process.

Quote:
i would need something like: last sentence on a page ends with a period. -> check if first sentence on the next page is indented. if so -> new paragraph.
Personally, I think that the human way (you read it and decide if a paragraph should or shouldn't be there) the easiest. Most of the time, anyway. You can't avoid proofreading the OCRed text anyway, so you can just as well do the paragraph thing at the same time.
pepak is offline   Reply With Quote
Old 04-03-2010, 04:50 AM   #5
Jellby
frumious Bandersnatch
Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.
 
Jellby's Avatar
 
Posts: 6,105
Karma: 4791309
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
Quote:
Originally Posted by pepak View Post
Personally, I think that the human way (you read it and decide if a paragraph should or shouldn't be there) the easiest. Most of the time, anyway. You can't avoid proofreading the OCRed text anyway, so you can just as well do the paragraph thing at the same time.
I agree. You have to read the book anyway. But just detecting paragraph break at page breaks is rather fast, you can check the beginning of every page and check whether pages that start with uppercase are new paragraphs or not (the OCR software will probably treat all of them in the same way, you only have to look for those that are not correct).
Jellby is offline   Reply With Quote
Old 05-06-2010, 02:30 PM   #6
Blackguard
Enthusiast
Blackguard began at the beginning.
 
Posts: 29
Karma: 14
Join Date: Feb 2008
Device: Kindle 2
I mostly just scan with Tesseract these days, because the accuracy is so good and I can manipulate it from the command-line. I don't really care about the italics, I manually add them as I proof-read. I used to use Finereader, and I thought it did a really terrible job on italicized one-word or two- to three-word groups, it constantly missed them, so I said just forget since I mostly hated their html output anyway. It did fine on the paragraphs and italicized sentences however.
Blackguard is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Flip scanning Kumabjorn News 14 02-01-2014 06:58 PM
Scanning in your own books gazza News 115 12-28-2009 05:32 PM
"Online Novels" - FREE, legal novels available on the Internet Dr. Drib Deals, Freebies, and Resources (No Self-Promotion) 8 05-22-2009 09:32 PM
on scanning Paul Moews iRex 9 10-17-2007 01:42 AM
More Conversion Questions - Serial Novels RWood Sony Reader 8 05-06-2007 01:10 PM


All times are GMT -4. The time now is 03:32 PM.


MobileRead.com is a privately owned, operated and funded community.