12-19-2009, 12:58 AM | #1 |
Wizard
Posts: 2,409
Karma: 4132096
Join Date: Sep 2008
Device: Kindle Paperwhite/iOS Kindle App
|
Need some PDF help please!
I posted this the other in the workshop section and got no answers. I am a little desperate as this is a big project, I really need to start on it and I can't start it until I know what to do. Can someone please help me?
Here is what I posted in Workshop: I picked up a cheap scanner and I am disappointed. I tried scanning a paperback book in English, and it did a terrible job, lots of weird symbols all over the place. Then I tried the teaching guides which are the main reason I wanted the scanner. What a mess! It seems the problem is that the text is half in French and half in English (e.g. it has prompts in English telling you what to say in French to the kids, for example "say 'je suis ici' while pointing at yourself.") So when I set the scanner to OCR mode and the language was English, I got gibberish. When I set it to French, things improved a little and it got much of it, but the text still needed a lot of cleaning up. I thought maybe it was just that the software which came with the scanner was not that great. So I downloaded a few utilities which claim to extract text from PDFs. They had great reviews. They totally choked on the French parts. The PDF looks fine (I made a two-page sampler for testing purposes), but displays a bit too small for easy reading on the Sony. I uploaded it as a PDF, LRF and epub separately. The epub could not zoom at all (i.e. the page stayed looking the same no matter what). The LRF looked just like the PDF on lowest zoom but when I tried to zoom in, the text got garbled as it had when I tried to extract it from the PDF. So, there are three possibilities here: 1) The scanner is not that great 2) The scanner is fine and I just need better software 3) Dual-language files are too hard and I am stuck with PDF What do you think? Is there anything I can do here, or will I go to all this work just to wind up with itty bitty text in a PDF file? If so, it may not be worth scanning them all... |
12-19-2009, 04:33 AM | #2 |
Connoisseur
Posts: 84
Karma: 1110
Join Date: Aug 2009
Location: Netherlands
Device: iRex iLiad v2
|
whether its 1, 2 or 3 (or a combo) I cant tell with just a vague description, but from the looks of it at least part of it is related to the ocr-software.
I've recently ocr-ed a screenshot with Dutch text on it with abbyy english, and it made all sorts of weird faults. What that program does is make a decent guess and then run it trough a sort of dictionary, so with English and French mixed text abbyy isn't gonna work well. In the early days of ocr, ocr-software made a guess and if it wasn't sure you had to teach it what the letter/symbol was. I assume that kind of software will work a lot better in your case. What software that would be I don't know tho. I haven't had to ocr anything in at least a decade... about the scans: - high resolution, low/no compression, high contrast, and straight/horizontal lines all reduce ocr-faults. Some of this you might need to fix depending on your scan- and ocr-results. And for text you don't need color ... gl |
Advert | |
|
12-19-2009, 04:51 AM | #3 |
Wizard
Posts: 1,479
Karma: 3846231
Join Date: Apr 2009
Location: Edinburgh, Scotland
Device: Kindle 3, Samsung Galaxy
|
Ficbot,
I would've thought that OCR software worked in exactly the same way, regardless of the language. It looks at each character separately, and tries to determine which letter or symbol it represents. It doesn't know anything about words or sentences or meanings. It justs converts shapes to letters, etc. So the fact the book was partly in French and partly in English is probably irrelevant. More likely, either the software is poor or the original printed pages are difficult to read for some reason. To determine which part of the system isn't working properly, try eliminating each variable in turn. Start by scanning an image. Does the result look like the original? If so, the scanner itself is probably OK. Next, try scanning a simple page of text, with a single clear font. If the OCR fails to convert it, then its the software that's at fault. Finally, if you can get access to a different type of scanner, test it with the English / French book that was causing the problem. If the results are still bad, that suggests that the problem lies in the quality of printed page, or perhaps in the fonts. I hope you manage to find a solution. |
12-19-2009, 04:54 AM | #4 |
eBook Enthusiast
Posts: 85,544
Karma: 93383043
Join Date: Nov 2006
Location: UK
Device: Kindle Oasis 2, iPad Pro 10.5", iPhone 6
|
Please continue this discussion in the original thread:
https://www.mobileread.com/forums/showthread.php?t=65993 This really does not belong in "News and Commentary". We will close this thread. Thank you. |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
eBook PDF - free tool for creating PDF eBooks from text files | KACartlidge | 6 | 01-04-2012 09:41 AM | |
Cool PDF to iBooks trick using PDF Services in OSX | kjk | Apple Devices | 0 | 06-30-2010 03:19 PM |
【Best PDF Size】I find The reason of slowing When Read PDF file | linlance | Sony Reader | 0 | 03-11-2010 08:13 AM |
Comparison classic PDF vs PDF reflow | josecastanon1 | Sony Reader | 1 | 10-14-2008 09:59 PM |