02-17-2013, 12:19 PM | #46 | |
A Hairy Wizard
Posts: 3,069
Karma: 18727053
Join Date: Dec 2012
Location: Charleston, SC today
Device: iPhone 11/X/6/iPad 1,2,Air & Air Pro/Surface Pro/Kindle PW & Fire
|
Quote:
Assuming the PDF they give you is a perfect OCR of the original - you would still need to go back and manually format the entire book to make it like the original. I did an experiment by creating a test page in Word with different formatting of sections of text. I then saved that document as a PDF. This provides a "perfect OCR of the original image". When I opened that PDF in Acrobat Pro, everything looked as it should and I could perform a find on any of the words in there. I then saved the PDF as text. Acrobat gives 2 options, Plain text and Accessible text - I did both. In both cases the text was correct but without ANY formatting. If there is a different way of saving a PDF to text, I would be very interested to know how. Sample OCR text.pdf Sample OCR text - plain.txt Sample OCR text - accessible.txt |
|
02-17-2013, 12:33 PM | #47 | |
Wizard
Posts: 2,977
Karma: 18343081
Join Date: Oct 2010
Location: Sudbury, ON, Canada
Device: PRS-505, PB 902, PRS-T1, PB 623, PB 840, PB 633
|
Quote:
|
|
02-17-2013, 01:20 PM | #48 |
Bookaholic
Posts: 14,391
Karma: 54969924
Join Date: Oct 2007
Location: Minnesota
Device: iPad Mini 4, AuraHD, iPhone XR +
|
|
02-17-2013, 01:58 PM | #49 | |
A Hairy Wizard
Posts: 3,069
Karma: 18727053
Join Date: Dec 2012
Location: Charleston, SC today
Device: iPhone 11/X/6/iPad 1,2,Air & Air Pro/Surface Pro/Kindle PW & Fire
|
Quote:
It is fairly clean...better than before...and you are right it saves bold and italics...but it still has some issues. On this very simple test page there are several formatting discrepancies that would need to be fixed...not impossible with search and replace, but very time consuming. I would be hesitant to try anything more complex or longer than a simple page or two. Thanks! Sample OCR text.html |
|
02-17-2013, 02:45 PM | #50 |
Bookaholic
Posts: 14,391
Karma: 54969924
Join Date: Oct 2007
Location: Minnesota
Device: iPad Mini 4, AuraHD, iPhone XR +
|
The output from Acrobat Pro can vary greatly depending on the original source and what tools were used to make the PDF (and possibly the PDF version). I hate converting PDF, but sometimes it's the only source and I usually find the HTML export to be the lesser of evils so to speak. On a few occasions I've gotten better HTML by importing the PDF into Mobipocket Creator.
|
02-17-2013, 08:47 PM | #51 |
Guru
Posts: 932
Karma: 15752887
Join Date: Mar 2011
Location: Norway
Device: Ipad, kindle paperwhite
|
I actually have built myself one of these scanners - as a Christmas present for myself - and it works fine. As a student at the technical university in Norway I have access to a CNC router and could make the parts quite cheaply. With a hi-res camera, I can scan ~600dpi in color in about 4 seconds. It takes so long because I store the captured JPG-image and the raw-data directly from the camera, pluss I process the images in real-time instead of post-processing. If I only had taken the compressed jpg-image, it would have taken about a second (max two seconds) per dual-page.
Right now, I'm getting acceptable results. I have some problems with reflections in the glass, but that seems to be unavoidable with this design. I am experimenting with techniques for removing reflections and glare, but have not been 100% successful. Rotating, descewing and OCRing can be done quite efficiently as post-processing by quite simple scripts. All in all, this is a very fine scanner for books. It can scan my old, valuable books without destroying them, and I now can read them (i.e. the digital copies) in bed while the originals slumbers safely in my shelves. |
02-17-2013, 08:49 PM | #52 | |
Fanatic
Posts: 579
Karma: 3549018
Join Date: Jul 2004
Location: Michigan
Device: Kindle Scribe, Kindle PW (10th & 11th gen); Fire HD 10
|
Quote:
That's why you need to use an actual OCR program. I use Abbyy. It will open a PDF and extract the pages as TIF files, then do it's thing. And it works fairly well on stuff like paperbacks. It will capture bold, italics, etc. If you want to OCR stuff like textbooks that contain lots of illustrations and such, I don't know of anything that works 100%. |
|
02-18-2013, 07:47 AM | #53 |
Addict
Posts: 304
Karma: 2454436
Join Date: Sep 2008
Device: PRS-505, PRS-650, iPad, Samsung Galaxy SII (JB), Google Nexus 7 (2013)
|
Actually, you are allowed to copy parts of many kinds of reference documents at a library. For instance you're allowed to copy up to 10% of an official British Standard.
|
02-19-2013, 12:18 AM | #54 |
Junior Member
Posts: 8
Karma: 5034
Join Date: Apr 2009
Device: none
|
You sure did, thanks Turtle. I thought I recognized your username.
|
02-19-2013, 12:20 AM | #55 |
Junior Member
Posts: 8
Karma: 5034
Join Date: Apr 2009
Device: none
|
I think we're talking about this on the DIY Book Scanner forums, but have you tried moving your lights further up from the glass? Glad to hear you were able to get your machine together!
|
Thread Tools | Search this Thread |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Min screen size for A4 pages | aero13792468 | General Discussions | 9 | 05-24-2011 08:00 AM |
DIY Scanner | Eratosthenes | News | 14 | 04-16-2010 04:21 PM |
DIY Book Scanner article in Wired | sassanik | News | 3 | 12-12-2009 02:43 PM |
High-speed book scanner works as pages turn | Shadowplay | News | 5 | 08-13-2009 07:29 PM |
DIY High-speed Book Scanner Plans | danielreetz | Workshop | 17 | 06-25-2009 08:17 AM |