Need some PDF help please!

ficbot · 12-19-2009, 12:58 AM

I posted this the other in the workshop section and got no answers. I am a little desperate as this is a big project, I really need to start on it and I can't start it until I know what to do. Can someone please help me?

Here is what I posted in Workshop:

I picked up a cheap scanner and I am disappointed. I tried scanning a paperback book in English, and it did a terrible job, lots of weird symbols all over the place. Then I tried the teaching guides which are the main reason I wanted the scanner. What a mess! It seems the problem is that the text is half in French and half in English (e.g. it has prompts in English telling you what to say in French to the kids, for example "say 'je suis ici' while pointing at yourself.") So when I set the scanner to OCR mode and the language was English, I got gibberish. When I set it to French, things improved a little and it got much of it, but the text still needed a lot of cleaning up.

I thought maybe it was just that the software which came with the scanner was not that great. So I downloaded a few utilities which claim to extract text from PDFs. They had great reviews. They totally choked on the French parts.

The PDF looks fine (I made a two-page sampler for testing purposes), but displays a bit too small for easy reading on the Sony. I uploaded it as a PDF, LRF and epub separately. The epub could not zoom at all (i.e. the page stayed looking the same no matter what). The LRF looked just like the PDF on lowest zoom but when I tried to zoom in, the text got garbled as it had when I tried to extract it from the PDF.

So, there are three possibilities here:

1) The scanner is not that great
2) The scanner is fine and I just need better software
3) Dual-language files are too hard and I am stuck with PDF

What do you think? Is there anything I can do here, or will I go to all this work just to wind up with itty bitty text in a PDF file? If so, it may not be worth scanning them all...

A4- · 12-19-2009, 04:33 AM

whether its 1, 2 or 3 (or a combo) I cant tell with just a vague description, but from the looks of it at least part of it is related to the ocr-software.

I've recently ocr-ed a screenshot with Dutch text on it with abbyy english, and it made all sorts of weird faults. What that program does is make a decent guess and then run it trough a sort of dictionary, so with English and French mixed text abbyy isn't gonna work well.
In the early days of ocr, ocr-software made a guess and if it wasn't sure you had to teach it what the letter/symbol was. I assume that kind of software will work a lot better in your case. What software that would be I don't know tho. I haven't had to ocr anything in at least a decade...

about the scans:
- high resolution, low/no compression, high contrast, and straight/horizontal lines all reduce ocr-faults. Some of this you might need to fix depending on your scan- and ocr-results. And for text you don't need color ...

gl

Mike L · 12-19-2009, 04:51 AM

Ficbot,

I would've thought that OCR software worked in exactly the same way, regardless of the language. It looks at each character separately, and tries to determine which letter or symbol it represents. It doesn't know anything about words or sentences or meanings. It justs converts shapes to letters, etc.

So the fact the book was partly in French and partly in English is probably irrelevant. More likely, either the software is poor or the original printed pages are difficult to read for some reason.

To determine which part of the system isn't working properly, try eliminating each variable in turn. Start by scanning an image. Does the result look like the original? If so, the scanner itself is probably OK. Next, try scanning a simple page of text, with a single clear font. If the OCR fails to convert it, then its the software that's at fault.

Finally, if you can get access to a different type of scanner, test it with the English / French book that was causing the problem. If the results are still bad, that suggests that the problem lies in the quality of printed page, or perhaps in the fonts.

I hope you manage to find a solution.

HarryT · 12-19-2009, 04:54 AM

Please continue this discussion in the original thread:

https://www.mobileread.com/forums/showthread.php?t=65993

This really does not belong in "News and Commentary".

We will close this thread.

Thank you.

12-19-2009, 12:58 AM	#1
ficbot Wizard Posts: 2,409 Karma: 4132096 Join Date: Sep 2008 Device: Kindle Paperwhite/iOS Kindle App	Need some PDF help please! I posted this the other in the workshop section and got no answers. I am a little desperate as this is a big project, I really need to start on it and I can't start it until I know what to do. Can someone please help me? Here is what I posted in Workshop: I picked up a cheap scanner and I am disappointed. I tried scanning a paperback book in English, and it did a terrible job, lots of weird symbols all over the place. Then I tried the teaching guides which are the main reason I wanted the scanner. What a mess! It seems the problem is that the text is half in French and half in English (e.g. it has prompts in English telling you what to say in French to the kids, for example "say 'je suis ici' while pointing at yourself.") So when I set the scanner to OCR mode and the language was English, I got gibberish. When I set it to French, things improved a little and it got much of it, but the text still needed a lot of cleaning up. I thought maybe it was just that the software which came with the scanner was not that great. So I downloaded a few utilities which claim to extract text from PDFs. They had great reviews. They totally choked on the French parts. The PDF looks fine (I made a two-page sampler for testing purposes), but displays a bit too small for easy reading on the Sony. I uploaded it as a PDF, LRF and epub separately. The epub could not zoom at all (i.e. the page stayed looking the same no matter what). The LRF looked just like the PDF on lowest zoom but when I tried to zoom in, the text got garbled as it had when I tried to extract it from the PDF. So, there are three possibilities here: 1) The scanner is not that great 2) The scanner is fine and I just need better software 3) Dual-language files are too hard and I am stuck with PDF What do you think? Is there anything I can do here, or will I go to all this work just to wind up with itty bitty text in a PDF file? If so, it may not be worth scanning them all...

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
eBook PDF - free tool for creating PDF eBooks from text files	KACartlidge	PDF	6	01-04-2012 09:41 AM
Cool PDF to iBooks trick using PDF Services in OSX	kjk	Apple Devices	0	06-30-2010 03:19 PM
【Best PDF Size】I find The reason of slowing When Read PDF file	linlance	Sony Reader	0	03-11-2010 08:13 AM
Comparison classic PDF vs PDF reflow	josecastanon1	Sony Reader	1	10-14-2008 09:59 PM

12-19-2009, 04:33 AM	#2
A4- Connoisseur Posts: 84 Karma: 1110 Join Date: Aug 2009 Location: Netherlands Device: iRex iLiad v2	whether its 1, 2 or 3 (or a combo) I cant tell with just a vague description, but from the looks of it at least part of it is related to the ocr-software. I've recently ocr-ed a screenshot with Dutch text on it with abbyy english, and it made all sorts of weird faults. What that program does is make a decent guess and then run it trough a sort of dictionary, so with English and French mixed text abbyy isn't gonna work well. In the early days of ocr, ocr-software made a guess and if it wasn't sure you had to teach it what the letter/symbol was. I assume that kind of software will work a lot better in your case. What software that would be I don't know tho. I haven't had to ocr anything in at least a decade... about the scans: - high resolution, low/no compression, high contrast, and straight/horizontal lines all reduce ocr-faults. Some of this you might need to fix depending on your scan- and ocr-results. And for text you don't need color ... gl

12-19-2009, 04:51 AM	#3
Mike L Wizard Posts: 1,479 Karma: 3846231 Join Date: Apr 2009 Location: Edinburgh, Scotland Device: Kindle 3, Samsung Galaxy	Ficbot, I would've thought that OCR software worked in exactly the same way, regardless of the language. It looks at each character separately, and tries to determine which letter or symbol it represents. It doesn't know anything about words or sentences or meanings. It justs converts shapes to letters, etc. So the fact the book was partly in French and partly in English is probably irrelevant. More likely, either the software is poor or the original printed pages are difficult to read for some reason. To determine which part of the system isn't working properly, try eliminating each variable in turn. Start by scanning an image. Does the result look like the original? If so, the scanner itself is probably OK. Next, try scanning a simple page of text, with a single clear font. If the OCR fails to convert it, then its the software that's at fault. Finally, if you can get access to a different type of scanner, test it with the English / French book that was causing the problem. If the results are still bad, that suggests that the problem lies in the quality of printed page, or perhaps in the fonts. I hope you manage to find a solution.

12-19-2009, 04:54 AM	#4
HarryT eBook Enthusiast Posts: 85,544 Karma: 93383043 Join Date: Nov 2006 Location: UK Device: Kindle Oasis 2, iPad Pro 10.5", iPhone 6	Please continue this discussion in the original thread: https://www.mobileread.com/forums/showthread.php?t=65993 This really does not belong in "News and Commentary". We will close this thread. Thank you.

Advert