04-02-2010, 09:31 AM | #1 |
Junior Member
Posts: 5
Karma: 10
Join Date: Mar 2010
Device: PB 301
|
2 questions concerning the scanning novels
hi,
i recently thought of buying a scanner and use destructive scanning (i.e. cut the binding) to convert novels into epub. to get an idea how tedious this process would be, i converted books, that i have in pdf format on my pc, to tiff and then ran some linux ocr software on them. i encountered two issues, which would slow down the conversion process tremendously, if i can't solve them: 1) detection of italic fonts. 2) detection of paragraphs: the ocr software i was using, detected paragraphs within a page fine. but since it operated on a single page, it couldn't recognise, if the last sentence on a page, that ended with a period there, was also the end of a paragraph or not. is there any ocr software (windows or linux), that could reliable handle those two problems? 1) is "only" an ocr problem, but for 2) i would need something like: last sentence on a page ends with a period. -> check if first sentence on the next page is indented. if so -> new paragraph. cheers 71117c |
04-02-2010, 11:49 AM | #2 |
Wizard
Posts: 1,213
Karma: 12890
Join Date: Feb 2009
Location: Amherst, Massachusetts, USA
Device: Sony PRS-505
|
Out of curiousity, what OCR program were you using?
I've been meaning to try out various OCR options for linux (Tessaract, Cuneiform, Ocropus, etc.), but just haven't found time for it. If I get around to it, I'd be happy to swap notes. A lot of people around here rave about ABBYY finereader (Windows) for OCR, and I'd imagine it can handle both these desiderata, though as an open source enthusiast, I always like to see what other options are available before moving to commercial tools. |
Advert | |
|
04-02-2010, 12:42 PM | #3 | ||
Junior Member
Posts: 5
Karma: 10
Join Date: Mar 2010
Device: PB 301
|
Quote:
my workflow was as follows: [.) convert the pdf to tiff with ghostscript] .) run cuneiform on the tiffs -> html output. one file per tiff. .) merged the html files into a single xhtml file. .) used vim to remove page numbers (if any), remove hyphenations,... .) and finally pasted the file into sigil to insert a cover, add chapters, ... character recognition of cuneiform is pretty good. sometimes italic characters weren't detected as such (so no <i> tags around them in the html output) and also the problem mentioned in my first post concerning paragraphs that span over two pages (but this is due to the way cuneiform operates ...) Quote:
|
||
04-02-2010, 02:17 PM | #4 | ||||
Guru
Posts: 610
Karma: 4150
Join Date: Mar 2008
Device: Sony Reader PRS-T3, Kobo Libra H2O
|
Quote:
Quote:
Quote:
Quote:
|
||||
04-03-2010, 04:50 AM | #5 |
frumious Bandersnatch
Posts: 7,516
Karma: 19000001
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
|
I agree. You have to read the book anyway. But just detecting paragraph break at page breaks is rather fast, you can check the beginning of every page and check whether pages that start with uppercase are new paragraphs or not (the OCR software will probably treat all of them in the same way, you only have to look for those that are not correct).
|
Advert | |
|
05-06-2010, 02:30 PM | #6 |
Enthusiast
Posts: 29
Karma: 14
Join Date: Feb 2008
Device: Kindle 2
|
I mostly just scan with Tesseract these days, because the accuracy is so good and I can manipulate it from the command-line. I don't really care about the italics, I manually add them as I proof-read. I used to use Finereader, and I thought it did a really terrible job on italicized one-word or two- to three-word groups, it constantly missed them, so I said just forget since I mostly hated their html output anyway. It did fine on the paragraphs and italicized sentences however.
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Scanning in your own books | gazza | News | 125 | 01-24-2016 04:42 PM |
Flip scanning | Kumabjorn | News | 14 | 02-01-2014 06:58 PM |
"Online Novels" - FREE, legal novels available on the Internet | Dr. Drib | Deals and Resources (No Self-Promotion or Affiliate Links) | 8 | 05-22-2009 09:32 PM |
on scanning | Paul Moews | iRex | 9 | 10-17-2007 01:42 AM |
More Conversion Questions - Serial Novels | RWood | Sony Reader | 8 | 05-06-2007 01:10 PM |