|
|
#46 | |
|
Fanatic
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 535
Karma: 2178910
Join Date: Dec 2012
Location: Bangkok, Thailand today
Device: iPhone 5/iPad 1&2/Surface Pro/Kindle PW
|
Quote:
Assuming the PDF they give you is a perfect OCR of the original - you would still need to go back and manually format the entire book to make it like the original. I did an experiment by creating a test page in Word with different formatting of sections of text. I then saved that document as a PDF. This provides a "perfect OCR of the original image". When I opened that PDF in Acrobat Pro, everything looked as it should and I could perform a find on any of the words in there. I then saved the PDF as text. Acrobat gives 2 options, Plain text and Accessible text - I did both. In both cases the text was correct but without ANY formatting. If there is a different way of saving a PDF to text, I would be very interested to know how. Sample OCR text.pdf Sample OCR text - plain.txt Sample OCR text - accessible.txt
__________________
Dion "Gnihcnip" - the act of "reverse pinching" to expand/zoom. Pronounced "Niknip" (the "g" and "h" are silent). "Live long and prosper." ~ Spock "What's that goat doing up in the clouds? " ~ Pilot
|
|
|
|
|
|
|
#47 | |
|
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,242
Karma: 2818722
Join Date: Oct 2010
Location: Vancouver, BC, Canada
Device: PRS-505, PB 902, PRS-T1
|
Quote:
|
|
|
|
|
|
Enthusiast
|
|
|
|
#48 |
|
Bookaholic
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 7,398
Karma: 18771935
Join Date: Oct 2007
Location: Minnesota
Device: AuraHD, Nook HD+, Kindle 2,3,T , Opus, TF101, Nexus7, iPT, iPhone5
|
With Acrobat Pro I usually save as HTML or RTF, which usually allows formatting like italics to be kept.
__________________
~Brian "The test of any good fiction is that you should care something for the characters; the good to succeed, the bad to fail. The trouble with most fiction is that you want them all to land in hell together, as quickly as possible." — Mark Twain |
|
|
|
|
|
#49 | |
|
Fanatic
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 535
Karma: 2178910
Join Date: Dec 2012
Location: Bangkok, Thailand today
Device: iPhone 5/iPad 1&2/Surface Pro/Kindle PW
|
Quote:
It is fairly clean...better than before...and you are right it saves bold and italics...but it still has some issues. On this very simple test page there are several formatting discrepancies that would need to be fixed...not impossible with search and replace, but very time consuming. I would be hesitant to try anything more complex or longer than a simple page or two. Thanks! Sample OCR text.html
__________________
Dion "Gnihcnip" - the act of "reverse pinching" to expand/zoom. Pronounced "Niknip" (the "g" and "h" are silent). "Live long and prosper." ~ Spock "What's that goat doing up in the clouds? " ~ Pilot
|
|
|
|
|
|
|
#50 |
|
Bookaholic
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 7,398
Karma: 18771935
Join Date: Oct 2007
Location: Minnesota
Device: AuraHD, Nook HD+, Kindle 2,3,T , Opus, TF101, Nexus7, iPT, iPhone5
|
The output from Acrobat Pro can vary greatly depending on the original source and what tools were used to make the PDF (and possibly the PDF version). I hate converting PDF, but sometimes it's the only source and I usually find the HTML export to be the lesser of evils so to speak. On a few occasions I've gotten better HTML by importing the PDF into Mobipocket Creator.
__________________
~Brian "The test of any good fiction is that you should care something for the characters; the good to succeed, the bad to fail. The trouble with most fiction is that you want them all to land in hell together, as quickly as possible." — Mark Twain |
|
|
|
|
|
#51 |
|
Guru
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 909
Karma: 15697153
Join Date: Mar 2011
Location: Norway
Device: Ipad, kindle paperwhite
|
I actually have built myself one of these scanners - as a Christmas present for myself - and it works fine. As a student at the technical university in Norway I have access to a CNC router and could make the parts quite cheaply. With a hi-res camera, I can scan ~600dpi in color in about 4 seconds. It takes so long because I store the captured JPG-image and the raw-data directly from the camera, pluss I process the images in real-time instead of post-processing. If I only had taken the compressed jpg-image, it would have taken about a second (max two seconds) per dual-page.
Right now, I'm getting acceptable results. I have some problems with reflections in the glass, but that seems to be unavoidable with this design. I am experimenting with techniques for removing reflections and glare, but have not been 100% successful. Rotating, descewing and OCRing can be done quite efficiently as post-processing by quite simple scripts. All in all, this is a very fine scanner for books. It can scan my old, valuable books without destroying them, and I now can read them (i.e. the digital copies) in bed while the originals slumbers safely in my shelves.
__________________
Just to avoid confusion: I have changed my user name/screen name from Norway1456 to Iznogood. New name - same guy. PS: The name should not be taken seriously |
|
|
|
|
|
#52 | |
|
Evangelist
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 415
Karma: 3394968
Join Date: Jul 2004
Location: Orlando
Device: Nook HD+; Nook Tablet; Kindle 3; Kindle for Android on HTC EVO
|
Quote:
That's why you need to use an actual OCR program. I use Abbyy. It will open a PDF and extract the pages as TIF files, then do it's thing. And it works fairly well on stuff like paperbacks. It will capture bold, italics, etc. If you want to OCR stuff like textbooks that contain lots of illustrations and such, I don't know of anything that works 100%. |
|
|
|
|
|
|
#53 |
|
Addict
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 253
Karma: 1292106
Join Date: Sep 2008
Device: PRS-505, PRS-650, iPad, Samsung Galaxy SII (ICS)
|
Actually, you are allowed to copy parts of many kinds of reference documents at a library. For instance you're allowed to copy up to 10% of an official British Standard.
__________________
What with ebooks slowly murdering the print industry in its sleep, why not take the opportunity to mess around with the format in ways we never could before? — Yatzhee in Extra Punctuation |
|
|
|
|
|
#54 | |
|
Junior Member
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 8
Karma: 5034
Join Date: Apr 2009
Device: none
|
Quote:
|
|
|
|
|
|
|
#55 |
|
Junior Member
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 8
Karma: 5034
Join Date: Apr 2009
Device: none
|
I think we're talking about this on the DIY Book Scanner forums, but have you tried moving your lights further up from the glass? Glad to hear you were able to get your machine together!
|
|
|
|
![]() |
| Thread Tools | Search this Thread |
|
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Min screen size for A4 pages | aero13792468 | General Discussions | 9 | 05-24-2011 08:00 AM |
| DIY Scanner | Eratosthenes | News | 14 | 04-16-2010 04:21 PM |
| DIY Book Scanner article in Wired | sassanik | News | 3 | 12-12-2009 02:43 PM |
| High-speed book scanner works as pages turn | Shadowplay | News | 5 | 08-13-2009 07:29 PM |
| DIY High-speed Book Scanner Plans | danielreetz | Workshop | 17 | 06-25-2009 08:17 AM |