Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book General > News

Notices

Reply
 
Thread Tools Search this Thread
Old 02-17-2013, 12:19 PM   #46
Turtle91
Fanatic
Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.
 
Turtle91's Avatar
 
Posts: 535
Karma: 2178910
Join Date: Dec 2012
Location: Bangkok, Thailand today
Device: iPhone 5/iPad 1&2/Surface Pro/Kindle PW
Quote:
Originally Posted by BeccaPrice View Post
With 1dollarscan, the pdf is OCR'd. What I did was save the PDF as a TXT file (I"ve got Acrobat Pro, so I can do that), and then had a file I could edit for scanning errors.
I was under the impression that Acrobat - even Pro - doesn't keep the formatting when you save to text. In which case you will not have any of the italics, bold, superscript, etc.

Assuming the PDF they give you is a perfect OCR of the original - you would still need to go back and manually format the entire book to make it like the original.

I did an experiment by creating a test page in Word with different formatting of sections of text. I then saved that document as a PDF. This provides a "perfect OCR of the original image". When I opened that PDF in Acrobat Pro, everything looked as it should and I could perform a find on any of the words in there. I then saved the PDF as text. Acrobat gives 2 options, Plain text and Accessible text - I did both. In both cases the text was correct but without ANY formatting.

If there is a different way of saving a PDF to text, I would be very interested to know how.

Sample OCR text.pdf
Sample OCR text - plain.txt
Sample OCR text - accessible.txt
__________________
Dion
"Gnihcnip" - the act of "reverse pinching" to expand/zoom. Pronounced "Niknip" (the "g" and "h" are silent).

"Live long and prosper." ~ Spock
"What's that goat doing up in the clouds?" ~ Pilot
Turtle91 is offline   Reply With Quote
Old 02-17-2013, 12:33 PM   #47
rkomar
Wizard
rkomar ought to be getting tired of karma fortunes by now.rkomar ought to be getting tired of karma fortunes by now.rkomar ought to be getting tired of karma fortunes by now.rkomar ought to be getting tired of karma fortunes by now.rkomar ought to be getting tired of karma fortunes by now.rkomar ought to be getting tired of karma fortunes by now.rkomar ought to be getting tired of karma fortunes by now.rkomar ought to be getting tired of karma fortunes by now.rkomar ought to be getting tired of karma fortunes by now.rkomar ought to be getting tired of karma fortunes by now.rkomar ought to be getting tired of karma fortunes by now.
 
Posts: 1,242
Karma: 2818722
Join Date: Oct 2010
Location: Vancouver, BC, Canada
Device: PRS-505, PB 902, PRS-T1
Quote:
Originally Posted by Turtle91 View Post
No. It will scan the text and OCR, but computers aren't smart enough to know what it MEANS...at least that I've seen.

If you wanted active hyperlinks in an ePub/html you would need to manually insert the links to the proper locations. If all you want is a digital copy you could save it in PDF.
I scanned my old textbooks and references to free up shelf space. Because they were so technical (equations, tables, code snippets,...), I couldn't really OCR them and turn them into epubs. So, I saved the cleaned up images as PDFs, and added a table of contents to each using PDFMARKs (those bookmarks you see on the left in most PDF readers). It's a fair bit of work, but I think it's necessary for a reference to be useable. I didn't do it myself, but you can also OCR the text and add it as an invisible layer in the document to make it searchable. Of course, you can only search the text that OCRs perfectly.
rkomar is offline   Reply With Quote
 
Enthusiast
Old 02-17-2013, 01:20 PM   #48
AnemicOak
Bookaholic
AnemicOak ought to be getting tired of karma fortunes by now.AnemicOak ought to be getting tired of karma fortunes by now.AnemicOak ought to be getting tired of karma fortunes by now.AnemicOak ought to be getting tired of karma fortunes by now.AnemicOak ought to be getting tired of karma fortunes by now.AnemicOak ought to be getting tired of karma fortunes by now.AnemicOak ought to be getting tired of karma fortunes by now.AnemicOak ought to be getting tired of karma fortunes by now.AnemicOak ought to be getting tired of karma fortunes by now.AnemicOak ought to be getting tired of karma fortunes by now.AnemicOak ought to be getting tired of karma fortunes by now.
 
AnemicOak's Avatar
 
Posts: 7,398
Karma: 18771935
Join Date: Oct 2007
Location: Minnesota
Device: AuraHD, Nook HD+, Kindle 2,3,T , Opus, TF101, Nexus7, iPT, iPhone5
Quote:
Originally Posted by Turtle91 View Post
If there is a different way of saving a PDF to text, I would be very interested to know how.
With Acrobat Pro I usually save as HTML or RTF, which usually allows formatting like italics to be kept.
__________________
~Brian

"The test of any good fiction is that you should care something for the characters; the good to succeed, the bad to fail. The trouble with most fiction is that you want them all to land in hell together, as quickly as possible."

— Mark Twain
AnemicOak is offline   Reply With Quote
Old 02-17-2013, 01:58 PM   #49
Turtle91
Fanatic
Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.
 
Turtle91's Avatar
 
Posts: 535
Karma: 2178910
Join Date: Dec 2012
Location: Bangkok, Thailand today
Device: iPhone 5/iPad 1&2/Surface Pro/Kindle PW
Quote:
Originally Posted by AnemicOak View Post
With Acrobat Pro I usually save as HTML or RTF, which usually allows formatting like italics to be kept.
I hadn't used Acrobat's save to HTML in a while. Previous versions weren't very good - thus part of the "nightmare of converting". But I have a new(er) version (10) and tried it out.

It is fairly clean...better than before...and you are right it saves bold and italics...but it still has some issues. On this very simple test page there are several formatting discrepancies that would need to be fixed...not impossible with search and replace, but very time consuming. I would be hesitant to try anything more complex or longer than a simple page or two.

Thanks!

Sample OCR text.html
__________________
Dion
"Gnihcnip" - the act of "reverse pinching" to expand/zoom. Pronounced "Niknip" (the "g" and "h" are silent).

"Live long and prosper." ~ Spock
"What's that goat doing up in the clouds?" ~ Pilot
Turtle91 is offline   Reply With Quote
Old 02-17-2013, 02:45 PM   #50
AnemicOak
Bookaholic
AnemicOak ought to be getting tired of karma fortunes by now.AnemicOak ought to be getting tired of karma fortunes by now.AnemicOak ought to be getting tired of karma fortunes by now.AnemicOak ought to be getting tired of karma fortunes by now.AnemicOak ought to be getting tired of karma fortunes by now.AnemicOak ought to be getting tired of karma fortunes by now.AnemicOak ought to be getting tired of karma fortunes by now.AnemicOak ought to be getting tired of karma fortunes by now.AnemicOak ought to be getting tired of karma fortunes by now.AnemicOak ought to be getting tired of karma fortunes by now.AnemicOak ought to be getting tired of karma fortunes by now.
 
AnemicOak's Avatar
 
Posts: 7,398
Karma: 18771935
Join Date: Oct 2007
Location: Minnesota
Device: AuraHD, Nook HD+, Kindle 2,3,T , Opus, TF101, Nexus7, iPT, iPhone5
The output from Acrobat Pro can vary greatly depending on the original source and what tools were used to make the PDF (and possibly the PDF version). I hate converting PDF, but sometimes it's the only source and I usually find the HTML export to be the lesser of evils so to speak. On a few occasions I've gotten better HTML by importing the PDF into Mobipocket Creator.
__________________
~Brian

"The test of any good fiction is that you should care something for the characters; the good to succeed, the bad to fail. The trouble with most fiction is that you want them all to land in hell together, as quickly as possible."

— Mark Twain
AnemicOak is offline   Reply With Quote
Old 02-17-2013, 08:47 PM   #51
Iznogood
Guru
Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.
 
Iznogood's Avatar
 
Posts: 909
Karma: 15697153
Join Date: Mar 2011
Location: Norway
Device: Ipad, kindle paperwhite
I actually have built myself one of these scanners - as a Christmas present for myself - and it works fine. As a student at the technical university in Norway I have access to a CNC router and could make the parts quite cheaply. With a hi-res camera, I can scan ~600dpi in color in about 4 seconds. It takes so long because I store the captured JPG-image and the raw-data directly from the camera, pluss I process the images in real-time instead of post-processing. If I only had taken the compressed jpg-image, it would have taken about a second (max two seconds) per dual-page.

Right now, I'm getting acceptable results. I have some problems with reflections in the glass, but that seems to be unavoidable with this design. I am experimenting with techniques for removing reflections and glare, but have not been 100% successful.

Rotating, descewing and OCRing can be done quite efficiently as post-processing by quite simple scripts. All in all, this is a very fine scanner for books. It can scan my old, valuable books without destroying them, and I now can read them (i.e. the digital copies) in bed while the originals slumbers safely in my shelves.
__________________
Just to avoid confusion: I have changed my user name/screen name from Norway1456 to Iznogood. New name - same guy.

PS: The name should not be taken seriously
Iznogood is offline   Reply With Quote
Old 02-17-2013, 08:49 PM   #52
kevinp
Evangelist
kevinp ought to be getting tired of karma fortunes by now.kevinp ought to be getting tired of karma fortunes by now.kevinp ought to be getting tired of karma fortunes by now.kevinp ought to be getting tired of karma fortunes by now.kevinp ought to be getting tired of karma fortunes by now.kevinp ought to be getting tired of karma fortunes by now.kevinp ought to be getting tired of karma fortunes by now.kevinp ought to be getting tired of karma fortunes by now.kevinp ought to be getting tired of karma fortunes by now.kevinp ought to be getting tired of karma fortunes by now.kevinp ought to be getting tired of karma fortunes by now.
 
kevinp's Avatar
 
Posts: 415
Karma: 3394968
Join Date: Jul 2004
Location: Orlando
Device: Nook HD+; Nook Tablet; Kindle 3; Kindle for Android on HTC EVO
Quote:
Originally Posted by Turtle91 View Post
I was under the impression that Acrobat - even Pro - doesn't keep the formatting when you save to text. In which case you will not have any of the italics, bold, superscript, etc.

Assuming the PDF they give you is a perfect OCR of the original - you would still need to go back and manually format the entire book to make it like the original.

I did an experiment by creating a test page in Word with different formatting of sections of text. I then saved that document as a PDF. This provides a "perfect OCR of the original image". When I opened that PDF in Acrobat Pro, everything looked as it should and I could perform a find on any of the words in there. I then saved the PDF as text. Acrobat gives 2 options, Plain text and Accessible text - I did both. In both cases the text was correct but without ANY formatting.

If there is a different way of saving a PDF to text, I would be very interested to know how.
The (so-called) OCR in Acrobat is mainly just so you can search for text in the PDF. It's not made to do what you are thinking.

That's why you need to use an actual OCR program. I use Abbyy. It will open a PDF and extract the pages as TIF files, then do it's thing. And it works fairly well on stuff like paperbacks. It will capture bold, italics, etc.

If you want to OCR stuff like textbooks that contain lots of illustrations and such, I don't know of anything that works 100%.
kevinp is offline   Reply With Quote
Old 02-18-2013, 07:47 AM   #53
Kirtai
Addict
Kirtai ought to be getting tired of karma fortunes by now.Kirtai ought to be getting tired of karma fortunes by now.Kirtai ought to be getting tired of karma fortunes by now.Kirtai ought to be getting tired of karma fortunes by now.Kirtai ought to be getting tired of karma fortunes by now.Kirtai ought to be getting tired of karma fortunes by now.Kirtai ought to be getting tired of karma fortunes by now.Kirtai ought to be getting tired of karma fortunes by now.Kirtai ought to be getting tired of karma fortunes by now.Kirtai ought to be getting tired of karma fortunes by now.Kirtai ought to be getting tired of karma fortunes by now.
 
Posts: 253
Karma: 1292106
Join Date: Sep 2008
Device: PRS-505, PRS-650, iPad, Samsung Galaxy SII (ICS)
Quote:
Originally Posted by HarryT View Post
Libraries? It's one thing to scan your own books - quite another to scan library books. That's just blatant copyright infringement.
Actually, you are allowed to copy parts of many kinds of reference documents at a library. For instance you're allowed to copy up to 10% of an official British Standard.
__________________
What with ebooks slowly murdering the print industry in its sleep, why not take the opportunity to mess around with the format in ways we never could before?
— Yatzhee in Extra Punctuation
Kirtai is offline   Reply With Quote
Old 02-19-2013, 12:18 AM   #54
danielreetz
Junior Member
danielreetz , Klaatu Barada Niktu!danielreetz , Klaatu Barada Niktu!danielreetz , Klaatu Barada Niktu!danielreetz , Klaatu Barada Niktu!danielreetz , Klaatu Barada Niktu!danielreetz , Klaatu Barada Niktu!danielreetz , Klaatu Barada Niktu!danielreetz , Klaatu Barada Niktu!danielreetz , Klaatu Barada Niktu!danielreetz , Klaatu Barada Niktu!danielreetz , Klaatu Barada Niktu!
 
Posts: 8
Karma: 5034
Join Date: Apr 2009
Device: none
Quote:
Originally Posted by Turtle91 View Post
Hey Daniel!
I was wondering when your ears would start twitching and you would notice this thread!

It's good to hear some actual numbers instead of relying on my old memory.

I hope I got most of the info right.

Cheers,
You sure did, thanks Turtle. I thought I recognized your username.
danielreetz is offline   Reply With Quote
Old 02-19-2013, 12:20 AM   #55
danielreetz
Junior Member
danielreetz , Klaatu Barada Niktu!danielreetz , Klaatu Barada Niktu!danielreetz , Klaatu Barada Niktu!danielreetz , Klaatu Barada Niktu!danielreetz , Klaatu Barada Niktu!danielreetz , Klaatu Barada Niktu!danielreetz , Klaatu Barada Niktu!danielreetz , Klaatu Barada Niktu!danielreetz , Klaatu Barada Niktu!danielreetz , Klaatu Barada Niktu!danielreetz , Klaatu Barada Niktu!
 
Posts: 8
Karma: 5034
Join Date: Apr 2009
Device: none
Quote:
Originally Posted by Iznogood View Post
Right now, I'm getting acceptable results. I have some problems with reflections in the glass, but that seems to be unavoidable with this design. I am experimenting with techniques for removing reflections and glare, but have not been 100% successful.
I think we're talking about this on the DIY Book Scanner forums, but have you tried moving your lights further up from the glass? Glad to hear you were able to get your machine together!
danielreetz is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Min screen size for A4 pages aero13792468 General Discussions 9 05-24-2011 08:00 AM
DIY Scanner Eratosthenes News 14 04-16-2010 04:21 PM
DIY Book Scanner article in Wired sassanik News 3 12-12-2009 02:43 PM
High-speed book scanner works as pages turn Shadowplay News 5 08-13-2009 07:29 PM
DIY High-speed Book Scanner Plans danielreetz Workshop 17 06-25-2009 08:17 AM


All times are GMT -4. The time now is 05:51 PM.


MobileRead.com is a privately owned, operated and funded community.