Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book General > News

Notices

Closed Thread
 
Thread Tools Search this Thread
Old 12-19-2009, 01:58 AM   #1
ficbot
Wizard
ficbot ought to be getting tired of karma fortunes by now.ficbot ought to be getting tired of karma fortunes by now.ficbot ought to be getting tired of karma fortunes by now.ficbot ought to be getting tired of karma fortunes by now.ficbot ought to be getting tired of karma fortunes by now.ficbot ought to be getting tired of karma fortunes by now.ficbot ought to be getting tired of karma fortunes by now.ficbot ought to be getting tired of karma fortunes by now.ficbot ought to be getting tired of karma fortunes by now.ficbot ought to be getting tired of karma fortunes by now.ficbot ought to be getting tired of karma fortunes by now.
 
Posts: 2,390
Karma: 4115574
Join Date: Sep 2008
Device: Kindle Paperwhite/iOS Kindle App
Need some PDF help please!

I posted this the other in the workshop section and got no answers. I am a little desperate as this is a big project, I really need to start on it and I can't start it until I know what to do. Can someone please help me?

Here is what I posted in Workshop:

I picked up a cheap scanner and I am disappointed. I tried scanning a paperback book in English, and it did a terrible job, lots of weird symbols all over the place. Then I tried the teaching guides which are the main reason I wanted the scanner. What a mess! It seems the problem is that the text is half in French and half in English (e.g. it has prompts in English telling you what to say in French to the kids, for example "say 'je suis ici' while pointing at yourself.") So when I set the scanner to OCR mode and the language was English, I got gibberish. When I set it to French, things improved a little and it got much of it, but the text still needed a lot of cleaning up.

I thought maybe it was just that the software which came with the scanner was not that great. So I downloaded a few utilities which claim to extract text from PDFs. They had great reviews. They totally choked on the French parts.

The PDF looks fine (I made a two-page sampler for testing purposes), but displays a bit too small for easy reading on the Sony. I uploaded it as a PDF, LRF and epub separately. The epub could not zoom at all (i.e. the page stayed looking the same no matter what). The LRF looked just like the PDF on lowest zoom but when I tried to zoom in, the text got garbled as it had when I tried to extract it from the PDF.

So, there are three possibilities here:

1) The scanner is not that great
2) The scanner is fine and I just need better software
3) Dual-language files are too hard and I am stuck with PDF

What do you think? Is there anything I can do here, or will I go to all this work just to wind up with itty bitty text in a PDF file? If so, it may not be worth scanning them all...
ficbot is offline  
Old 12-19-2009, 05:33 AM   #2
A4-
Connoisseur
A4- can extract oil from cheeseA4- can extract oil from cheeseA4- can extract oil from cheeseA4- can extract oil from cheeseA4- can extract oil from cheeseA4- can extract oil from cheeseA4- can extract oil from cheeseA4- can extract oil from cheeseA4- can extract oil from cheese
 
A4-'s Avatar
 
Posts: 84
Karma: 1110
Join Date: Aug 2009
Location: Netherlands
Device: iRex iLiad v2
whether its 1, 2 or 3 (or a combo) I cant tell with just a vague description, but from the looks of it at least part of it is related to the ocr-software.

I've recently ocr-ed a screenshot with Dutch text on it with abbyy english, and it made all sorts of weird faults. What that program does is make a decent guess and then run it trough a sort of dictionary, so with English and French mixed text abbyy isn't gonna work well.
In the early days of ocr, ocr-software made a guess and if it wasn't sure you had to teach it what the letter/symbol was. I assume that kind of software will work a lot better in your case. What software that would be I don't know tho. I haven't had to ocr anything in at least a decade...

about the scans:
- high resolution, low/no compression, high contrast, and straight/horizontal lines all reduce ocr-faults. Some of this you might need to fix depending on your scan- and ocr-results. And for text you don't need color ...

gl
A4- is offline  
 
Advertisement
Old 12-19-2009, 05:51 AM   #3
Mike L
Wizard
Mike L ought to be getting tired of karma fortunes by now.Mike L ought to be getting tired of karma fortunes by now.Mike L ought to be getting tired of karma fortunes by now.Mike L ought to be getting tired of karma fortunes by now.Mike L ought to be getting tired of karma fortunes by now.Mike L ought to be getting tired of karma fortunes by now.Mike L ought to be getting tired of karma fortunes by now.Mike L ought to be getting tired of karma fortunes by now.Mike L ought to be getting tired of karma fortunes by now.Mike L ought to be getting tired of karma fortunes by now.Mike L ought to be getting tired of karma fortunes by now.
 
Mike L's Avatar
 
Posts: 1,441
Karma: 3818575
Join Date: Apr 2009
Location: Edinburgh, Scotland
Device: Kindle 3, Samsung Galaxy
Ficbot,

I would've thought that OCR software worked in exactly the same way, regardless of the language. It looks at each character separately, and tries to determine which letter or symbol it represents. It doesn't know anything about words or sentences or meanings. It justs converts shapes to letters, etc.

So the fact the book was partly in French and partly in English is probably irrelevant. More likely, either the software is poor or the original printed pages are difficult to read for some reason.

To determine which part of the system isn't working properly, try eliminating each variable in turn. Start by scanning an image. Does the result look like the original? If so, the scanner itself is probably OK. Next, try scanning a simple page of text, with a single clear font. If the OCR fails to convert it, then its the software that's at fault.

Finally, if you can get access to a different type of scanner, test it with the English / French book that was causing the problem. If the results are still bad, that suggests that the problem lies in the quality of printed page, or perhaps in the fonts.

I hope you manage to find a solution.
Mike L is offline  
Old 12-19-2009, 05:54 AM   #4
HarryT
eBook Enthusiast
HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.
 
HarryT's Avatar
 
Posts: 65,448
Karma: 43770933
Join Date: Nov 2006
Location: UK
Device: Kindle Voyage, iPad Mini, iPhone 4, MS Surface Pro, N7
Please continue this discussion in the original thread:

http://www.mobileread.com/forums/showthread.php?t=65993

This really does not belong in "News and Commentary".

We will close this thread.

Thank you.

HarryT is online now  
Closed Thread

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
eBook PDF - free tool for creating PDF eBooks from text files KACartlidge PDF 6 01-04-2012 10:41 AM
Cool PDF to iBooks trick using PDF Services in OSX kjk Apple Devices 0 06-30-2010 04:19 PM
【Best PDF Size】I find The reason of slowing When Read PDF file linlance Sony Reader 0 03-11-2010 09:13 AM
Comparison classic PDF vs PDF reflow josecastanon1 Sony Reader 1 10-14-2008 10:59 PM


All times are GMT -4. The time now is 10:46 AM.


MobileRead.com is a privately owned, operated and funded community.