Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book General > News

Notices

Reply
 
Thread Tools Search this Thread
Old 02-17-2013, 01:19 PM   #46
Turtle91
Guru
Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.
 
Turtle91's Avatar
 
Posts: 669
Karma: 3807234
Join Date: Dec 2012
Location: Shannon, Ireland today
Device: iPhone 5/iPad 1&2/Surface Pro/Kindle PW
Quote:
Originally Posted by BeccaPrice View Post
With 1dollarscan, the pdf is OCR'd. What I did was save the PDF as a TXT file (I"ve got Acrobat Pro, so I can do that), and then had a file I could edit for scanning errors.
I was under the impression that Acrobat - even Pro - doesn't keep the formatting when you save to text. In which case you will not have any of the italics, bold, superscript, etc.

Assuming the PDF they give you is a perfect OCR of the original - you would still need to go back and manually format the entire book to make it like the original.

I did an experiment by creating a test page in Word with different formatting of sections of text. I then saved that document as a PDF. This provides a "perfect OCR of the original image". When I opened that PDF in Acrobat Pro, everything looked as it should and I could perform a find on any of the words in there. I then saved the PDF as text. Acrobat gives 2 options, Plain text and Accessible text - I did both. In both cases the text was correct but without ANY formatting.

If there is a different way of saving a PDF to text, I would be very interested to know how.

Sample OCR text.pdf
Sample OCR text - plain.txt
Sample OCR text - accessible.txt
Turtle91 is offline   Reply With Quote
Old 02-17-2013, 01:33 PM   #47
rkomar
Wizard
rkomar ought to be getting tired of karma fortunes by now.rkomar ought to be getting tired of karma fortunes by now.rkomar ought to be getting tired of karma fortunes by now.rkomar ought to be getting tired of karma fortunes by now.rkomar ought to be getting tired of karma fortunes by now.rkomar ought to be getting tired of karma fortunes by now.rkomar ought to be getting tired of karma fortunes by now.rkomar ought to be getting tired of karma fortunes by now.rkomar ought to be getting tired of karma fortunes by now.rkomar ought to be getting tired of karma fortunes by now.rkomar ought to be getting tired of karma fortunes by now.
 
Posts: 1,743
Karma: 4368476
Join Date: Oct 2010
Location: Vancouver, BC, Canada
Device: PRS-505, PB 902, PRS-T1, PB 623
Quote:
Originally Posted by Turtle91 View Post
No. It will scan the text and OCR, but computers aren't smart enough to know what it MEANS...at least that I've seen.

If you wanted active hyperlinks in an ePub/html you would need to manually insert the links to the proper locations. If all you want is a digital copy you could save it in PDF.
I scanned my old textbooks and references to free up shelf space. Because they were so technical (equations, tables, code snippets,...), I couldn't really OCR them and turn them into epubs. So, I saved the cleaned up images as PDFs, and added a table of contents to each using PDFMARKs (those bookmarks you see on the left in most PDF readers). It's a fair bit of work, but I think it's necessary for a reference to be useable. I didn't do it myself, but you can also OCR the text and add it as an invisible layer in the document to make it searchable. Of course, you can only search the text that OCRs perfectly.
rkomar is offline   Reply With Quote
 
Advertisement
Old 02-17-2013, 02:20 PM   #48
AnemicOak
Bookaholic
AnemicOak ought to be getting tired of karma fortunes by now.AnemicOak ought to be getting tired of karma fortunes by now.AnemicOak ought to be getting tired of karma fortunes by now.AnemicOak ought to be getting tired of karma fortunes by now.AnemicOak ought to be getting tired of karma fortunes by now.AnemicOak ought to be getting tired of karma fortunes by now.AnemicOak ought to be getting tired of karma fortunes by now.AnemicOak ought to be getting tired of karma fortunes by now.AnemicOak ought to be getting tired of karma fortunes by now.AnemicOak ought to be getting tired of karma fortunes by now.AnemicOak ought to be getting tired of karma fortunes by now.
 
AnemicOak's Avatar
 
Posts: 10,429
Karma: 28936355
Join Date: Oct 2007
Location: Minnesota
Device: HDX 8.9, AuraHD, Nook HD+, Kindle 2,3,T , Opus, Nexus7, iPhone5, etc
Quote:
Originally Posted by Turtle91 View Post
If there is a different way of saving a PDF to text, I would be very interested to know how.
With Acrobat Pro I usually save as HTML or RTF, which usually allows formatting like italics to be kept.
AnemicOak is offline   Reply With Quote
Old 02-17-2013, 02:58 PM   #49
Turtle91
Guru
Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.
 
Turtle91's Avatar
 
Posts: 669
Karma: 3807234
Join Date: Dec 2012
Location: Shannon, Ireland today
Device: iPhone 5/iPad 1&2/Surface Pro/Kindle PW
Quote:
Originally Posted by AnemicOak View Post
With Acrobat Pro I usually save as HTML or RTF, which usually allows formatting like italics to be kept.
I hadn't used Acrobat's save to HTML in a while. Previous versions weren't very good - thus part of the "nightmare of converting". But I have a new(er) version (10) and tried it out.

It is fairly clean...better than before...and you are right it saves bold and italics...but it still has some issues. On this very simple test page there are several formatting discrepancies that would need to be fixed...not impossible with search and replace, but very time consuming. I would be hesitant to try anything more complex or longer than a simple page or two.

Thanks!

Sample OCR text.html
Turtle91 is offline   Reply With Quote
Old 02-17-2013, 03:45 PM   #50
AnemicOak
Bookaholic
AnemicOak ought to be getting tired of karma fortunes by now.AnemicOak ought to be getting tired of karma fortunes by now.AnemicOak ought to be getting tired of karma fortunes by now.AnemicOak ought to be getting tired of karma fortunes by now.AnemicOak ought to be getting tired of karma fortunes by now.AnemicOak ought to be getting tired of karma fortunes by now.AnemicOak ought to be getting tired of karma fortunes by now.AnemicOak ought to be getting tired of karma fortunes by now.AnemicOak ought to be getting tired of karma fortunes by now.AnemicOak ought to be getting tired of karma fortunes by now.AnemicOak ought to be getting tired of karma fortunes by now.
 
AnemicOak's Avatar
 
Posts: 10,429
Karma: 28936355
Join Date: Oct 2007
Location: Minnesota
Device: HDX 8.9, AuraHD, Nook HD+, Kindle 2,3,T , Opus, Nexus7, iPhone5, etc
The output from Acrobat Pro can vary greatly depending on the original source and what tools were used to make the PDF (and possibly the PDF version). I hate converting PDF, but sometimes it's the only source and I usually find the HTML export to be the lesser of evils so to speak. On a few occasions I've gotten better HTML by importing the PDF into Mobipocket Creator.
AnemicOak is offline   Reply With Quote
Old 02-17-2013, 09:47 PM   #51
Iznogood
Guru
Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.
 
Iznogood's Avatar
 
Posts: 929
Karma: 15752887
Join Date: Mar 2011
Location: Norway
Device: Ipad, kindle paperwhite
I actually have built myself one of these scanners - as a Christmas present for myself - and it works fine. As a student at the technical university in Norway I have access to a CNC router and could make the parts quite cheaply. With a hi-res camera, I can scan ~600dpi in color in about 4 seconds. It takes so long because I store the captured JPG-image and the raw-data directly from the camera, pluss I process the images in real-time instead of post-processing. If I only had taken the compressed jpg-image, it would have taken about a second (max two seconds) per dual-page.

Right now, I'm getting acceptable results. I have some problems with reflections in the glass, but that seems to be unavoidable with this design. I am experimenting with techniques for removing reflections and glare, but have not been 100% successful.

Rotating, descewing and OCRing can be done quite efficiently as post-processing by quite simple scripts. All in all, this is a very fine scanner for books. It can scan my old, valuable books without destroying them, and I now can read them (i.e. the digital copies) in bed while the originals slumbers safely in my shelves.
Iznogood is offline   Reply With Quote
Old 02-17-2013, 09:49 PM   #52
kevinp
Fanatic
kevinp ought to be getting tired of karma fortunes by now.kevinp ought to be getting tired of karma fortunes by now.kevinp ought to be getting tired of karma fortunes by now.kevinp ought to be getting tired of karma fortunes by now.kevinp ought to be getting tired of karma fortunes by now.kevinp ought to be getting tired of karma fortunes by now.kevinp ought to be getting tired of karma fortunes by now.kevinp ought to be getting tired of karma fortunes by now.kevinp ought to be getting tired of karma fortunes by now.kevinp ought to be getting tired of karma fortunes by now.kevinp ought to be getting tired of karma fortunes by now.
 
kevinp's Avatar
 
Posts: 553
Karma: 3549018
Join Date: Jul 2004
Location: Orlando
Device: Nook HD+; Kindle for Android on HTC One SV
Quote:
Originally Posted by Turtle91 View Post
I was under the impression that Acrobat - even Pro - doesn't keep the formatting when you save to text. In which case you will not have any of the italics, bold, superscript, etc.

Assuming the PDF they give you is a perfect OCR of the original - you would still need to go back and manually format the entire book to make it like the original.

I did an experiment by creating a test page in Word with different formatting of sections of text. I then saved that document as a PDF. This provides a "perfect OCR of the original image". When I opened that PDF in Acrobat Pro, everything looked as it should and I could perform a find on any of the words in there. I then saved the PDF as text. Acrobat gives 2 options, Plain text and Accessible text - I did both. In both cases the text was correct but without ANY formatting.

If there is a different way of saving a PDF to text, I would be very interested to know how.
The (so-called) OCR in Acrobat is mainly just so you can search for text in the PDF. It's not made to do what you are thinking.

That's why you need to use an actual OCR program. I use Abbyy. It will open a PDF and extract the pages as TIF files, then do it's thing. And it works fairly well on stuff like paperbacks. It will capture bold, italics, etc.

If you want to OCR stuff like textbooks that contain lots of illustrations and such, I don't know of anything that works 100%.
kevinp is offline   Reply With Quote
Old 02-18-2013, 08:47 AM   #53
Kirtai
Addict
Kirtai ought to be getting tired of karma fortunes by now.Kirtai ought to be getting tired of karma fortunes by now.Kirtai ought to be getting tired of karma fortunes by now.Kirtai ought to be getting tired of karma fortunes by now.Kirtai ought to be getting tired of karma fortunes by now.Kirtai ought to be getting tired of karma fortunes by now.Kirtai ought to be getting tired of karma fortunes by now.Kirtai ought to be getting tired of karma fortunes by now.Kirtai ought to be getting tired of karma fortunes by now.Kirtai ought to be getting tired of karma fortunes by now.Kirtai ought to be getting tired of karma fortunes by now.
 
Posts: 301
Karma: 2454436
Join Date: Sep 2008
Device: PRS-505, PRS-650, iPad, Samsung Galaxy SII (JB), Google Nexus 7 (2013)
Quote:
Originally Posted by HarryT View Post
Libraries? It's one thing to scan your own books - quite another to scan library books. That's just blatant copyright infringement.
Actually, you are allowed to copy parts of many kinds of reference documents at a library. For instance you're allowed to copy up to 10% of an official British Standard.
Kirtai is offline   Reply With Quote
Old 02-19-2013, 01:18 AM   #54
danielreetz
Junior Member
danielreetz , Klaatu Barada Niktu!danielreetz , Klaatu Barada Niktu!danielreetz , Klaatu Barada Niktu!danielreetz , Klaatu Barada Niktu!danielreetz , Klaatu Barada Niktu!danielreetz , Klaatu Barada Niktu!danielreetz , Klaatu Barada Niktu!danielreetz , Klaatu Barada Niktu!danielreetz , Klaatu Barada Niktu!danielreetz , Klaatu Barada Niktu!danielreetz , Klaatu Barada Niktu!
 
Posts: 8
Karma: 5034
Join Date: Apr 2009
Device: none
Quote:
Originally Posted by Turtle91 View Post
Hey Daniel!
I was wondering when your ears would start twitching and you would notice this thread!

It's good to hear some actual numbers instead of relying on my old memory.

I hope I got most of the info right.

Cheers,
You sure did, thanks Turtle. I thought I recognized your username.
danielreetz is offline   Reply With Quote
Old 02-19-2013, 01:20 AM   #55
danielreetz
Junior Member
danielreetz , Klaatu Barada Niktu!danielreetz , Klaatu Barada Niktu!danielreetz , Klaatu Barada Niktu!danielreetz , Klaatu Barada Niktu!danielreetz , Klaatu Barada Niktu!danielreetz , Klaatu Barada Niktu!danielreetz , Klaatu Barada Niktu!danielreetz , Klaatu Barada Niktu!danielreetz , Klaatu Barada Niktu!danielreetz , Klaatu Barada Niktu!danielreetz , Klaatu Barada Niktu!
 
Posts: 8
Karma: 5034
Join Date: Apr 2009
Device: none
Quote:
Originally Posted by Iznogood View Post
Right now, I'm getting acceptable results. I have some problems with reflections in the glass, but that seems to be unavoidable with this design. I am experimenting with techniques for removing reflections and glare, but have not been 100% successful.
I think we're talking about this on the DIY Book Scanner forums, but have you tried moving your lights further up from the glass? Glad to hear you were able to get your machine together!
danielreetz is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Min screen size for A4 pages aero13792468 General Discussions 9 05-24-2011 09:00 AM
DIY Scanner Eratosthenes News 14 04-16-2010 05:21 PM
DIY Book Scanner article in Wired sassanik News 3 12-12-2009 03:43 PM
High-speed book scanner works as pages turn Shadowplay News 5 08-13-2009 08:29 PM
DIY High-speed Book Scanner Plans danielreetz Workshop 17 06-25-2009 09:17 AM


All times are GMT -4. The time now is 07:19 PM.


MobileRead.com is a privately owned, operated and funded community.