|04-30-2012, 04:08 AM||#1|
Join Date: Feb 2011
Device: Sony PRS-505
Further thoughts on scanning
It was Bob Russell's fine piece on the Optic Book 3600+ which got me started. Now I am trying to carry the flag.
Forgive me, but I don't have Bob's computing skills to present this in the same style as he.
I have a collection of pbooks which I decided to digitise (0r is that digitalise?) as an aid to exercise my brain on the path to old age. One gets fed up with crosswords and Sudoku.
I bought my 3600+ last year and to date have produced 38 ebooks, as ePub.
I have become unhappy with my output, as their appearance on my reader (PRS505) is not yet as good as the store-bought ebooks I have. (You have to have standards)
For some reason I wanted to justify my expenditure and thought I should produce 100 books each year and set about it. Now I have revised that aim to 60 quality books per year.
To achieve that quality I must learn more skills in the programs I am using to edit and finish my books. As I have no experience in the word-smithing professions, WordPad is as much as I ever needed. I will be producing a plea for help after finishing this.
I unpacked and set up the 3600 (Why does that remind me of a Harry Potter movie) according to Bob's instructions. The only difference , In my experience, was that the carriage lock/unlock was glaringly obvious. A spring-loaded peg within the base of the device IS the lock.
Peg in - unlocked, Peg out - locked. A slide on the base of the device can fix the lock on or fix the lock off.
Leave the peg to move as it may, the device will be unlocked when placed on a flat surface and the moving parts locked when it is lifted.
Sitting, staring at the scanner, I thought of all the information I had gathered while dredging the Forums to develop a plan of work.
My first thought was:- WHAT IS A BOOK?
My books are novels only! No pdfs, no images. no tables, the sort of book one picks up when passing through an airport and are good enough to keep and read again and again. No text books, I am beyond studying (If I want to know something important, I can ask my wife - she knows everything!)
What will be the end product?:- I decided on ePub, with Calibre one size fits all.
What approach to use?:- K.I.S.S., R.T.F.M., Practice,Practice,Practice.
What text editing program to use?:- Open Office is the best I can afford.
What shall I do next?:- Get stuck in!!
Phase 2 Scanning:-
Set up DigiBook according to Bob. Selected grey scale, page image, rotate on even numbers, all of that, booktitle.
I chose a large hardback novel 450+ pages (for an auspicious beginning) present page 1, press button, lift and turn book present page 2 -is upside down on platen- press button, zip, it appears on screen right way up. Great.
Carry on doing this for two and a half hours -easy? Not really
The blurb says No Spine Shadow, just lay it on the platen. not exactly true! One must hold the spine quite firmly - It becomes quite very tiring. I should have chosen a much smaller book to start with.
The worst type of book to scan is the omnibus type edition, very thick, tightly bound, narrow spine side margins. This needs a lot of push and shove. Relax and it will spring out!
Nevertheless I achieved the desired result.
Next step:- Click Transfer button.
The default Page Image is BMP, on anecdotal evidence I have chosen TIF as the means to carry on with the process.
DigiBook now converts BMP to TIF so that the OCR can take place using SprintExpress.
A small widow opens - Flashing to show progress of transfer. Halfway through, a Windows declaration indicates that there is trouble and DigiBook must close, which it does.
No intermediate 'save' steps!!!
All of this 21/2 hours of effort is held in RAM. Pfft it is gone.
Good Heavens I say, Heck I say, or words to that effect, disappointment reigns. I find out - the hard way that DigiBook is very good but a bit flaky.
One thought I had was that my computer was not robust enough. It is quite old, has only 1Gig of RAM which is cluttered up with - well - clutter. (Task Manager shows 380MB usage at idle)
Can 1Gig RAM hold 450+ Page Images in BMP and convert them to TIF Page Images?
As a result of this I now scan no more than 50 pages at a time and if I have been scanning photos prior to book scans, I start those with 20, 30, 50, 50... etc and get good results from this.
I find that I can average 200 pages per hour easily, I don't have to rush. I don't have to make it a chore.
The chores come later!
As that conversion is completed, another window opens to show the OCR progress.
This has a countdown, with which I can check that I have scanned all of the required pages.
I have a tendency to start reading the pages as they are shown on screen and often miss a turn or scan twice (I haven't read some of these books in years)
The OCR converts to WordPad RTF only - as a file BookTitle 0001, 0002, 0003 etc. 50 pages per file. In a folder one has previously chosen.
I always check each 50 in WordPad, it only takes a minute. so that I can correct before I carry on.
Some thoughts on this phase:-
DigiBook is the management program for scanning. One ends up with an RTF WordPad file.
Page Management Is By DigiBook
Abbyy Fine Reader is used as part of the process by DigiBook
It is a fait accompli.
One does text editing with a word processor of ones own choosing - NOT Abbyy Fine Reader.
All in all this is a very good and simple program for producing ebooks.
When I have completed all my books I can go back to store bought or rummage through second hand book stores to find 'out of print' stuff.
Phase 3 Compile the book
I always start with WordPad - select 0001, change title to the book title, numbering has been checked - check again(measure twice, cut once)
Reduce screen to half width. Alongside, open WordPad again with 0002. "select all' 'copy' and 'paste' to the bottom of Book Title, save and repeat with 0003, 0004 etc until complete.
I then gather all loose 000x's into a folder which can be dumped when the ebook is complete and back-up copies made.
Open Book title and tidy up. I use WordPad for this because of its simplicity, there is nothing extraneous.
Scroll through the book removing headers and footers, usually just page numbers, join top of page to bottom of previous page.
I have read on Forum that people use macros to do all of this. Well I know nothing of macros, I wouldn't recognise a macro if it bit me on the backside, but that is my cross to bear. As I said earlier I just get on with it.
Many of my books are from the 1940s and 1950s and so have very poor quality paper and ink, together with odd and crudely sized fonts, which, plus age give OCR a very hard time.
To correct these problems it is required to fire up Open Office to change font size and make "line size single' to suit that font.
Phase 4 The Editing
Now comes the chore!
I don't think that I can present a straight forward time line for the editing .
This is where my lack of experience of editing shows up.
I edited and learned how to do it at the same time. A mishmash of trial and error.
I fired up Open Office and opened Book Title, in ODT. Good grief look at all those wriggly red lines. Phew!
A careful perusal of the problems will show which one must correct and which one can disregard.
Many errors are those which OCR has mis-spelt, such as di for th. Thus die/the, dian/than.
These will be marked and obvious. (Unless this produces a properly spelt word!!)
These will be not marked or obvious, such as Mr Home/Mr Horne, which will be found during proof reading.
There will be many, maybe dozens or hundreds of the same error. These can be corrected by using "Find and replace' one must ensure that this is used in conjunction with "Whole words only' and “Match Case.”
Otherwise you will produce an equal number of errors which will not be high-lighted and must be searched for individually.
I believe that many errors which are blamed on the OCR program are really due to the quality of the source book. The difference of result between my old books and my newer books is huge, which simplifies the editing processes.
I hope no one will take the idea that I am 'teaching the Vicar to suck eggs' But there appears to be many others, who, like me are starting out on this road to ebooks. I wouldn't want them to make the same errors as I.
Open Office has an American/English dictionary as default, I have English/English books and live in an Australian/English world. The dictionary is the minimum of staple words and all other words not in the dictionary are marked as wrong.
I call up English/UK dictionary but it flips back to default.
A few of my books are reprints of American authors of the 30s, such as Damon Runyon, James Thurber, Ogden Nash, Anita Loos and many more. Many of the words used in the 30s don't appear in modern American dictionary nor do those in an English/English book.
A short story by Milt Gross written in the dialect of New York's East Side (Noo Yoik aw'reddy) would have about 40% of its content appear in any dictionary.
But I digress.
If you are satisfied that the 'errors' are actually proper words, they can be ignored. They have are of no consequence in the conversion to ePub, Mobi etc. (Check! Is this statement correct?)
Personally I have been loading my dictionaries with everything I can, this has proved to speed my work later on, especially when scanning a series of books. Fewer high-lighted words to linger on!
In spite of my desire to achieve high quality, I cannot stop the occasional spelling error creeping into the finished item. So what!
In any event I am the only one going to read them. They are not for publication or dissemination.
I have found that proof reading for long periods spoils my enjoyment of leisure reading.
Too critical an eye picks up spelling and grammatical errors which I have missed previously.
eg "A lone sentry standing virgil at the graveside"
"The coding is secure, we have a new logarithm"
I smell Spell Checker! (and these by a very respected publisher, well, big! )
So, editing is a chore, I find that I go through a book many times to edit and correct and also try to maintain the flow, the look and the feel of the original work which makes for such enjoyable reading and is the reason I have kept them all these years to re-read them again and again.
Phase 5 The Layout
I really do not know to start with this. I have tried many layouts and the difference between the ebook on computer screen and the reader is quite large.
The main difference is, after all of these trials is – I want Justified page throughout but my reader (PRS 505) presents Left Alignment!
I may start rambling here. This is where my lack of experience is beginning to show!
It would seem to be that it does not matter what other attributes one requires, font, size etc. What 505 wants, 505 gives! ( Is this a valid statement?)
I have finalised on A4, default style, Arial or Times New Roman 12 font size. Everything else, justification etc, is applied as an attribute, not incorporated as a defined style.
Therefore, I believe is not incorporated in the final construct data carried to the reader.
If this is correct then I need to learn how to apply styles to my layouts and form a template to use as the basis of my layouts.
Thus the construct data of the template will carry through to the reader!
I am treating 'construct data' as I believe 'meta data' works.
Does this sound real to anyone?
If so, then I need to learn how to determine my layout and save it as a book template proper!
Try as I might I cannot do this properly with 'File', 'Templates', 'Save' buttons. (probably doing it all wrong)
Nevertheless, I have my ebooks and am carrying on with the remainder.
My next problem is :- many of them are collections of short stories and will need TOC's.
Try as I might cannot fathom out how to do those either.
I have tried many instruction sets to do this but they are too involved, assume that I know much about word processing or the program 'Word'. I need step by step (baby steps) instructions.
Experts on the Forum, you know who you are, please look critically at your postings. Most of you post welcome positive knowledge but it is not all there. You know what you are saying but gloss over many of the minor details because those details are so obvious to you, but it those details which are needed to fill the voids in my knowledge.
Just a thought! (This is probably my cry for help)
I am trying very hard to complete this project and do it correctly (as is my wont) but am starting to believe I have become an old fart.
I was discussing prostate problems with my GP and said to him “I have become a classic grandad”. “Define 'classic Grandad' “said he.
A classic Grandad is stooped, portly, jolly, silver haired, smells of pi**.
I was going to discuss copyright but I thought - Don't get me started!!!
By the way I can recommend the Optic Book 3600 (I believe now 3800) for any home. If you have 100+ pbooks to keep and carry into retirement. This is a reasonable investment and a worthwhile hobby
|04-30-2012, 05:36 AM||#2|
Join Date: Nov 2009
Device: iPod touch 2G (16 GB)
Why are you saving them as BMP and then TIFF? For OCR-ing purposes, JPG (85-90% quality) works just fine and takes up A LOT less space in the initial, scanning phase. I scan at 300 DPI for the pages, and 600 DPI for the covers or other graphics in the book (charts, photos, etc). Anything lower or higher than that could mess with the OCR. For instance, at 600 DPI, small imperfections are detected as commas, dots, accents, etc., and since scanning at 150 DPI takes almost as long as 300 DPI, I use 300.
Speaking of proofreading, you say that it "spoils your enjoyment of leisure reading" but that you also read them "again and again". Then why not make an effort to read them in FineReader, at least once. You'll enjoy reading them the second time (on the e-reader) much more, because then you won't have to stop for misspellings or words that sound funny. Make good use of the dictionary when proofreading. Don't just "load it with everything you can" because they sound right. I usually look them up on dictionary.com first. If the printed book contains misspellings, I'm correcting them bee-hatches. In the (not so distant) future I may use text-to-speech software.
The The Impotence of Proofreading
LayoutPrep – a custom Word macro that preps your OCR content for styles
Resources for identifying fonts
Last edited by DSpider; 04-30-2012 at 05:44 AM.
|05-03-2012, 10:26 AM||#4|
Join Date: Feb 2011
Device: Sony PRS-505
Hi guys thank you for your comments, I expected a little more support and encouragement but I can live with what you are saying.
You seem to have missed the gist of what I said, except are extolling your expertise, without any offer of description or explanation!
I don't need a dictionary for myself. I am happy with my own spelling skills.
I am plumping up the on board dictionary for use by the OCR. ( or rather the word processor. In this case, Open Office.)
Suppose you are editing a series of books, as you progress later books, the editing becomes faster by being more correct.( fasterer and correcter?)
Many plots of my books are set in foreign climes and have quite a large number of words of other languages or vernacular.(vernaculae?)
I stick 'em in the dictionary.
The book I am working at present, contains many names in Turkish, Kurd, Lebanese, also their spoken words are presented as immigrant English. In the vernacular! One of a series of four!!
By sticking everything into the dictionary, the third and fourth books become much easier.
Previous books have plots using French, Spanish, Hungarian and many more languages.
I am advised to buy ABBYY Fine Reader 9 ( I think it is at iteration 10 now.)
I don't think that will happen. Do you know that would cost me twice what I paid for the complete device and soft ware?
I am not doing this for a living! I am doing this for 'fun!' It is a retirement project!
Should I buy a 'good' camera, also ABBYYFine Reader and build quadpod arrangement to do the job? For three times the cost? Nah!
By the way, is it acceptable, or advisable to trash the good name of a product, without offering evidence of fault? (It is called Blackguarding) Especially in these litigious times. I am sure this forum executive will be looking askance at such a claim and will be scurrying to disclaim such words.
What else? Oh yes. Use JPEG instead of BMP or TIFF.
I thought that an enthusiast might read Bob Russel's document first, but that was not to be.
I was advised by that document and chose the default BMP.
Why should I care whether the page image is lossless or not. It is temporary and after conversion and OCR processing it is lost and you receive a page in RTF.
I should advise that Fine Reader Proper is not supplied. it is a stripped down 'Express' version which is quite adequate for the job. It contains no word processor or dictionary. You can't even fire it up to OCR a page of text. It will not recognise it!
It only recognises a photographic page image such as BMP, etc.(TIFF is supposed to be best.)
I got it going and it worked!. I am happy
I have learned not to blame the scanner for all the faults in the reading!
I also scan in 300DPI for covers and get good results but I don't include covers in the ePub.
If the printed book contains misspellings I usually can correct them without using a dictionary and load my onboard dictionary with that corrected word.
I don't know what are bee-hatches are and in any future I will never use text-to-speech software.
|05-03-2012, 10:38 AM||#5|
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon
At the end you'd have every possible word in every possible language in the dictionary. Would that help? I don't think so.
|05-03-2012, 12:42 PM||#6|
Join Date: Nov 2006
Device: Kindle Touch, PW, Fire HD, iPad 3, iPhone 4, Samsung Tab 2 7 + More
I think you're doing very well to produce 60 high-quality books in a year, by the way. I spend 1-2h a day proof-reading public domain books in order to create nice e-books, and I'm happy to do 15-20 books in a year.
Currently proofreading The Poison Belt, by Sir Arthur Conan Doyle.
Last edited by HarryT; 05-03-2012 at 12:48 PM.
|Thread Tools||Search this Thread|
|Thread||Thread Starter||Forum||Replies||Last Post|
|Book Scanning||Lordblacknail||Workshop||1||10-13-2010 06:04 PM|
|Scanning project||Kumabjorn||Calibre||9||09-11-2010 11:31 AM|
|Scanning Magazines||Silverexpress||Workshop||0||04-22-2009 01:50 AM|
|on scanning||Paul Moews||iRex||9||10-17-2007 01:42 AM|
|Book scanning||kusmi||iRex||33||10-09-2007 05:34 AM|