Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Formats > Workshop

Notices

Reply
 
Thread Tools Search this Thread
Old 08-05-2010, 01:23 AM   #46
Lady Fitzgerald
Wizard
Lady Fitzgerald ought to be getting tired of karma fortunes by now.Lady Fitzgerald ought to be getting tired of karma fortunes by now.Lady Fitzgerald ought to be getting tired of karma fortunes by now.Lady Fitzgerald ought to be getting tired of karma fortunes by now.Lady Fitzgerald ought to be getting tired of karma fortunes by now.Lady Fitzgerald ought to be getting tired of karma fortunes by now.Lady Fitzgerald ought to be getting tired of karma fortunes by now.Lady Fitzgerald ought to be getting tired of karma fortunes by now.Lady Fitzgerald ought to be getting tired of karma fortunes by now.Lady Fitzgerald ought to be getting tired of karma fortunes by now.Lady Fitzgerald ought to be getting tired of karma fortunes by now.
 
Lady Fitzgerald's Avatar
 
Posts: 2,013
Karma: 251649
Join Date: Apr 2010
Location: Tempe, AZ, USA, Earth
Device: JetBook Lite (away from home) + 1 spare, 32" TV (at home)
Quote:
Originally Posted by Mr. Dalliard View Post
If you are prepared to rip the spine off the book, your task will be a lot easier, otherwise it is a lot of work.

It is far from being impossible though.
Not sure who you are addressing this to. On paperbacks, I just cut the spine off. Takes 15-30 seconds. On hardbacks, I have to cut the covers off before cutting the spines off. Cutting the covers off is easy, just run a knife over the "hinges" formed by the endpapers. If there is a corded ribbon (I forget the technical name and I'm too lazy to look it up right now) at each end of the spine, I rip those off. If the spine has been rounded, I can usually "break" it in several places by bending it back sharply which will usually let me flatten it enough to cut it off in the guillotine. If it is too stubborn to flatten or the book is too thick for the guillotine (it will handle only 1 1/2"), then I "split" the book into sections by "breaking" the spine and scoring it with a knife enough to let me snap it apart (same goes for really thick paperbacks). At worst, it only takes a minute or two to prepare a hardback for the guillotine. It takes about 15-30 seconds to actually cut the spine off.
Lady Fitzgerald is offline   Reply With Quote
Old 08-05-2010, 10:58 AM   #47
nyrath
Addict
nyrath reads XML... blindfoldednyrath reads XML... blindfoldednyrath reads XML... blindfoldednyrath reads XML... blindfoldednyrath reads XML... blindfoldednyrath reads XML... blindfoldednyrath reads XML... blindfoldednyrath reads XML... blindfoldednyrath reads XML... blindfoldednyrath reads XML... blindfoldednyrath reads XML... blindfolded
 
nyrath's Avatar
 
Posts: 281
Karma: 52007
Join Date: Jun 2010
Device: nook
Quote:
Originally Posted by Lady Fitzgerald View Post
Huge. Off the top of my head, I would say 15MB. Granted, that is much larger than typical e-books
But you do not have to proof-read. Which is no small consideration.

For the record, my OCR eBooks seem to average at about 0.5MB.
nyrath is offline   Reply With Quote
Advert
Old 08-05-2010, 11:59 AM   #48
Lady Fitzgerald
Wizard
Lady Fitzgerald ought to be getting tired of karma fortunes by now.Lady Fitzgerald ought to be getting tired of karma fortunes by now.Lady Fitzgerald ought to be getting tired of karma fortunes by now.Lady Fitzgerald ought to be getting tired of karma fortunes by now.Lady Fitzgerald ought to be getting tired of karma fortunes by now.Lady Fitzgerald ought to be getting tired of karma fortunes by now.Lady Fitzgerald ought to be getting tired of karma fortunes by now.Lady Fitzgerald ought to be getting tired of karma fortunes by now.Lady Fitzgerald ought to be getting tired of karma fortunes by now.Lady Fitzgerald ought to be getting tired of karma fortunes by now.Lady Fitzgerald ought to be getting tired of karma fortunes by now.
 
Lady Fitzgerald's Avatar
 
Posts: 2,013
Karma: 251649
Join Date: Apr 2010
Location: Tempe, AZ, USA, Earth
Device: JetBook Lite (away from home) + 1 spare, 32" TV (at home)
[QUOTE=nyrath;1043589]But you do not have to proof-read. Which is no small consideration...

True that!

Quote:
Originally Posted by nyrath View Post
For the record, my OCR eBooks seem to average at about 0.5MB.
That seems to be about average for the few e.pubs I do have.

Keep in mind reading the PDFs is a bit of an awkward compromise. A larger reader, like a tablet would be better but portability wins out in my case. A tablet won't fit in my purse but a reader will.
Lady Fitzgerald is offline   Reply With Quote
Old 08-29-2010, 02:24 PM   #49
Iain
Enthusiast
Iain began at the beginning.
 
Posts: 49
Karma: 14
Join Date: Jul 2010
Location: Harrogate, England
Device: iPad
Red face Converting my books - finally

Firstly, thanks for the comments I've read on on this forum and people who've answered my questions.


I've finally completed starting my digitising task! This whole thing has turned from a task into a fairly complex project, with a good deal of custom written software. And that's before I've digitised more than a few books!

I've blogged about this (horrid word and this is one of my first attempts at blogging) in some detail here (Iain's blog) but the short form goes like this.

I start off by cutting the spines off with a guillotine and counting the pages.

I've written a scanning program which talks to my Fujistu fi-6130. It captures the ISBN (bar code scanner or human entry) and finds the publication details (isbndb.com). I enter the subject and the number of pages and start the scan.

The program scans the first pages (the cover pages) in colour and the rest in monochrome. I do, of course, have to reload the hopper every minute or so, but that's quick and not too distracting. On completion, the tiff file (500MB - 2GB!) is queued for OCR and so on. If there are problems, then you can edit the tiff and delete pages or add new scanes.

The OCR processing side uses FineReader 10. I'm controlling FineReader through AutoHotKey so I don't have to interact with it. FineReader processes the document and saves it in word, html and text formats.

The word document is processed (again by a program of my own devising) and generates an ePub file which actually looks pretty good (though I say so myself).

Finally all the book details and the text are put in a database so that I can find books in a variety of ways.

That's the short form! The blog has a good deal more detail and I would welcome comments!

In particular, having spent a good deal of time writing code for this, I'm wondering if there is an opportunity to commercialise this.

Do you think people would be interested in a book digitisation service (I think I would have to charge about $2 a book and the book would be destroyed).

Do you think people would be interested in a more or less off the shelf system which could efficiently turn their mouldering paperbacks into prisine eBooks?

Let me know here or privately at iain AT idcl DOT co DOT uk
Iain is offline   Reply With Quote
Old 08-30-2010, 05:22 AM   #50
HarryT
eBook Enthusiast
HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.
 
HarryT's Avatar
 
Posts: 85,544
Karma: 93383099
Join Date: Nov 2006
Location: UK
Device: Kindle Oasis 2, iPad Pro 10.5", iPhone 6
Quote:
Originally Posted by Iain View Post
Do you think people would be interested in a book digitisation service (I think I would have to charge about $2 a book and the book would be destroyed).
I'm sure they would be, but I'm not sure about the legality of it in the UK. Format shifting is NOT legal here.
HarryT is offline   Reply With Quote
Advert
Old 08-30-2010, 09:17 AM   #51
Lady Fitzgerald
Wizard
Lady Fitzgerald ought to be getting tired of karma fortunes by now.Lady Fitzgerald ought to be getting tired of karma fortunes by now.Lady Fitzgerald ought to be getting tired of karma fortunes by now.Lady Fitzgerald ought to be getting tired of karma fortunes by now.Lady Fitzgerald ought to be getting tired of karma fortunes by now.Lady Fitzgerald ought to be getting tired of karma fortunes by now.Lady Fitzgerald ought to be getting tired of karma fortunes by now.Lady Fitzgerald ought to be getting tired of karma fortunes by now.Lady Fitzgerald ought to be getting tired of karma fortunes by now.Lady Fitzgerald ought to be getting tired of karma fortunes by now.Lady Fitzgerald ought to be getting tired of karma fortunes by now.
 
Lady Fitzgerald's Avatar
 
Posts: 2,013
Karma: 251649
Join Date: Apr 2010
Location: Tempe, AZ, USA, Earth
Device: JetBook Lite (away from home) + 1 spare, 32" TV (at home)
Quote:
Originally Posted by Iain View Post
Firstly, thanks for the comments I've read on on this forum and people who've answered my questions.


I've finally completed starting my digitising task! This whole thing has turned from a task into a fairly complex project, with a good deal of custom written software. And that's before I've digitised more than a few books!

I've blogged about this (horrid word and this is one of my first attempts at blogging) in some detail here (Iain's blog) but the short form goes like this.

I start off by cutting the spines off with a guillotine and counting the pages.

I've written a scanning program which talks to my Fujistu fi-6130. It captures the ISBN (bar code scanner or human entry) and finds the publication details (isbndb.com). I enter the subject and the number of pages and start the scan.

The program scans the first pages (the cover pages) in colour and the rest in monochrome. I do, of course, have to reload the hopper every minute or so, but that's quick and not too distracting. On completion, the tiff file (500MB - 2GB!) is queued for OCR and so on. If there are problems, then you can edit the tiff and delete pages or add new scanes.

The OCR processing side uses FineReader 10. I'm controlling FineReader through AutoHotKey so I don't have to interact with it. FineReader processes the document and saves it in word, html and text formats.

The word document is processed (again by a program of my own devising) and generates an ePub file which actually looks pretty good (though I say so myself).

Finally all the book details and the text are put in a database so that I can find books in a variety of ways.

That's the short form! The blog has a good deal more detail and I would welcome comments!

In particular, having spent a good deal of time writing code for this, I'm wondering if there is an opportunity to commercialise this.

Do you think people would be interested in a book digitisation service (I think I would have to charge about $2 a book and the book would be destroyed).

Do you think people would be interested in a more or less off the shelf system which could efficiently turn their mouldering paperbacks into prisine eBooks?

Let me know here or privately at iain AT idcl DOT co DOT uk
You do not edit after OCR?

On average, how much time did you spend on each book.
Lady Fitzgerald is offline   Reply With Quote
Old 08-31-2010, 06:10 AM   #52
Iain
Enthusiast
Iain began at the beginning.
 
Posts: 49
Karma: 14
Join Date: Jul 2010
Location: Harrogate, England
Device: iPad
Format Shifting in the UK

Thanks for your comment Harry - I'd not started this off as a commercial venture, so not researched. I see you are quite right and the whole thing is a complete mess.

It would appear, however, that I could manufacture and sell hardware and software shich 'format-shifted' books without infringing any law. The user of the equipment would be in breach (if they cared!) but not I.

There does seem some indication that the EU are moving, Leviathon-like, to some resolution of this and I may still be alive when they manage to get there!
Iain is offline   Reply With Quote
Old 08-31-2010, 06:36 AM   #53
Iain
Enthusiast
Iain began at the beginning.
 
Posts: 49
Karma: 14
Join Date: Jul 2010
Location: Harrogate, England
Device: iPad
Flaws and time

Quote:
Originally Posted by Lady Fitzgerald View Post
You do not edit after OCR?

On average, how much time did you spend on each book.

I do not edit after OCR. It's still early days and I'm refining the Word->ePub transformation. Also, it takes a good deal longer to READ the book than the whole rest of the process.

I'll report when I've read a dozen or so books, but so far I seem to have almost no character mis-recognitions. I'm talking of a handful in a book.

The other flaws I'm encountering may be artefacts of my word->ePub translation or of the OCR. I'm not sure which, yet. I'm expecting to be able to fix many of these either by fixing my code ( ) or by applying a bit of intelligence to the process.

So far (and this is NOT statistically reliable), I'm seeing a missing space about every 4 pages, a space added after a correctly- hyphenated (sic!) term about as often and a line break in a paragraph every 10 pages or so (I think I know what's causing this and *may* be able to fix it).

Actually, I'm delighted with the quality, though as I mentioned in my post I'm not the best person to proofread things.

As far as time is concerned, I've been doing some Hammond Innes this morning. It took me about 13 minutes to trim a dozen books. They are almost consistently sized and quite thin (280 pages or so) so they are about the easiest of all books to slice.

I've scanned about two whilst I've been writing this. One of my main objectives is to be able to scan whilst I work. If there are no issues with the scan, then it takes probably a minute of my time for a book this size to scan (bar code) the ISBN, enter the pages (and subject) and feed the hopper.

Issues (I seem to be fumble fingered this morning! - I've been putting the covers in the wrong way round) add some minutes.

I bought a Thomas Hardy (for 5 pence!) at a car boot sale yesterday and plan to scan that and compare it to a gutenberg version to get a more formal comparison. At some point!

Hope this is interesting...
Iain is offline   Reply With Quote
Old 08-31-2010, 09:12 AM   #54
Lady Fitzgerald
Wizard
Lady Fitzgerald ought to be getting tired of karma fortunes by now.Lady Fitzgerald ought to be getting tired of karma fortunes by now.Lady Fitzgerald ought to be getting tired of karma fortunes by now.Lady Fitzgerald ought to be getting tired of karma fortunes by now.Lady Fitzgerald ought to be getting tired of karma fortunes by now.Lady Fitzgerald ought to be getting tired of karma fortunes by now.Lady Fitzgerald ought to be getting tired of karma fortunes by now.Lady Fitzgerald ought to be getting tired of karma fortunes by now.Lady Fitzgerald ought to be getting tired of karma fortunes by now.Lady Fitzgerald ought to be getting tired of karma fortunes by now.Lady Fitzgerald ought to be getting tired of karma fortunes by now.
 
Lady Fitzgerald's Avatar
 
Posts: 2,013
Karma: 251649
Join Date: Apr 2010
Location: Tempe, AZ, USA, Earth
Device: JetBook Lite (away from home) + 1 spare, 32" TV (at home)
It is interesting for me since I'm in the process of digitizing my book collection.

No matter how good an OCR program may be, it will still take a fair amount of time to run. I have the version of ABBY Finereader that came with my Fujitsu ScanSnap s1500. I've only used it to give me searchable PDFs of tech magazines I have (obviously, no editing is required since there is no visible text generated other than the image of each page taken by the scanner). It takes around 30 minutes to an hour (I don't remember exactly) for the OCR to run on a 100 page magazine in addition to cutting and scanning the magazine (fortunately, I do not have very many magazines). Without OCR, I can scan, save, and catalogue 3-4 books per hour if I'm paying attention (usually I'm not; having ADD doesn't help). Since I have over 1500 books to do and want to finish before the end of the year, OCR just isn't an option, even without editing. I could always run my PDFs through OCR later but I don't plan on it. I'm able to easily read all but the largest books with the smaller print on a Jetbook Lite. Even the large page, small print books can be read without eyestrain on the JBL but it's a bit more awkward to scroll and good lighting becomes more critical. Using the JBL instead of a larger reader is a tradeoff to gain portability (it fits in my purse).

You said that your OCR process has few errors. How well does it deal with page headers and footers and page numbers? How about drop caps at the beginning of a sentence? Some of those use pretty intricate, decorative fonts. How about when fonts change within a book, such as bold text or italics? Is your OCR process able to replicate or accurately read those? Often, certain passages in a book have increased margins to denote a quoted passage, such as a paragraph from a letter. How does that get handled? Many fonts used in books have charaters that are similar or identical to others, such as the upper and lower case j being identical or the letters l and I being similar to each other and the number 1 (sometimes even identical). How well is that handled? How do images get handled? You said you can tolerate some mistakes. How many is some? Unfortunately, I would find any mistakes very distracting and annoying. For me editing would take about as long as would take to read the book. I can't spare even 30-60 minutes just run the OCR because of the large number of books I have and limited time available, even considering I'm retired now.

I wish getting an occasional cover wrong way around was my only operator error. I have been known to insert a set of pages in the ADF the wrong way. If the pages were merely upside down, it would be easy to correct in Adobe Acrobat 9 but if I get the order reversed, it's much faster to rescan those pages, then replace the incorrect pages with the newly scanned ones, again using Acrobat.

How many cuts have you made with your guillotine? Mine broke after only 250 books. Although I'm currently doing battle with Amazon over it since the guillotine they sold me apparently is an inferior knock off, I would consider spending the extra money to get a more reliable one.

My guillotine has a different clamping mechanism than yours but the fence is the same as yours. I also had problems trying to figure out where to set it because of no easy way to see where the cut will occur. I found the easiest way to align the fence (which also kept my fingers away from that vicious cutter blade) was to leave the blade dropped after the previous cut (I also store it that way), slip the book into place with the spine against the blade, lower the clamp until it lightly touches the book (but still allows free movement), push the fence tightly against the book until the pages are flush with the fence face, then tighten the clamp on the fence. I then raise the blade and lock it, push the book away from the fence slightly, slip a shim or two (thin pieces of cardboard; the number and thickness based on previous experience) between the fence and the book, then pull the book back against the fence. I then tighten the clamp a bit more, use a thin tool to gently bump the spine snug against the spine (the idea of the tool is to avoid getting my fingers near the blade; I almost lost the tip of a thumb to it when I first got it), then finish tightening the clamp and make the cut. I found this procedure goes quickly, is safe, and is more accurate than trying to eyeball where cut is going to take place.

If a book has a very curved spine and the gutter margin is too small to comfortably accomodate the curvature when cutting the spine off, on hard backs (I strip the cover off hardbacks before cutting to avoid excessively stressing the guillotine), I try "breaking" the spine by folding it sharply back in several places to try and make it easier to flatten the spine. If that doesn't work (and on paper backs), I cut the book apart into several smaller pieces, which minimizes the curvature of each section of book, then cut each piece one at a time.

Last edited by Lady Fitzgerald; 08-31-2010 at 10:56 AM.
Lady Fitzgerald is offline   Reply With Quote
Old 09-01-2010, 04:10 AM   #55
Iain
Enthusiast
Iain began at the beginning.
 
Posts: 49
Karma: 14
Join Date: Jul 2010
Location: Harrogate, England
Device: iPad
OK. Lots of questions there. I'll try and get answers to all in.

Firstly, my books are mainly of the 'pulp fiction' variety so tend to be light on posh formatting. I'm also still tuning the whole process so there's the what is being done and what can be done.

For a paperback book the OCR process takes roughly the same time as the scanning process. Somewhere between 4 and 10 minutes. That is with the latest FineReader running on a quadcore machine, so I can see how it could get to be 30 mins on an older machine with an older version.

The system I've written makes the processing automatic so I can do it on another machine or even overnight.

The OCR does a good job of italic and bold changes. It should do well for margin changes (the information is there in the word doc), though I've not yet processed (or at least proofed) a book which uses this.

I think there are around half a dozen character misreads in the 300 page book I've just 'proofed' (though my disclaimers are about my proofing skills remains!).

The more complex stuff which happens before and after the book (with decorative fonts and mixed up with graphics) can be a mess, so I would imagine anything complex in the middle will also be a mess. I'll look at dealing with the messes as I come across them!

I actually deliberately discard headers and footers. If you want pages to reflow as font sizes change then they aren't helpful. Having said that you've just make me realise I can use them to enhance chapter detection.

I suspect that I've been lucky with the books I've proofed so far and I also suspect I have a higher level of tolerence for errors!

Thanks for the advice on the guillotine. That all sounds like a good deal of sense - I too have lightly touched the blade (I had to remove the guard to see what is going on) and found it astonishingly sharp! I wish my kitchen knives were that sharp.

I suppose I have it in mind that if there are serious problems in a book I can go back to the original and tweak the OCR. I've also thought about writing an editing eBook reader for the iPad to tweak the minor errors. However, I doubt I will ever have the time or energy to do this.

In a couple of weeks I'll have a much better idea of the quality and will keep you posted on what I discover!

Iain
Iain is offline   Reply With Quote
Old 09-01-2010, 05:20 AM   #56
Lady Fitzgerald
Wizard
Lady Fitzgerald ought to be getting tired of karma fortunes by now.Lady Fitzgerald ought to be getting tired of karma fortunes by now.Lady Fitzgerald ought to be getting tired of karma fortunes by now.Lady Fitzgerald ought to be getting tired of karma fortunes by now.Lady Fitzgerald ought to be getting tired of karma fortunes by now.Lady Fitzgerald ought to be getting tired of karma fortunes by now.Lady Fitzgerald ought to be getting tired of karma fortunes by now.Lady Fitzgerald ought to be getting tired of karma fortunes by now.Lady Fitzgerald ought to be getting tired of karma fortunes by now.Lady Fitzgerald ought to be getting tired of karma fortunes by now.Lady Fitzgerald ought to be getting tired of karma fortunes by now.
 
Lady Fitzgerald's Avatar
 
Posts: 2,013
Karma: 251649
Join Date: Apr 2010
Location: Tempe, AZ, USA, Earth
Device: JetBook Lite (away from home) + 1 spare, 32" TV (at home)
Thanks, Iain.
Lady Fitzgerald is offline   Reply With Quote
Old 09-02-2010, 03:49 AM   #57
Iain
Enthusiast
Iain began at the beginning.
 
Posts: 49
Karma: 14
Join Date: Jul 2010
Location: Harrogate, England
Device: iPad
Just an update on that.

What I seem to be seeing is that character recognition is very accurate and most of the errors with spaces and line feeds I'm seeing are bugs in my conversion to ePub.

It also handles italics and bold and font size changes well.

However, at the moment it does not spot section indentation or justification changes. So some 'poems' are not inset and chapter headings not centered.

I may be able to get round this by using the more formatted output as a source, but haven't tried yet.

I'll keep you posted.

Iain
Iain is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Digitize your own books: The Book Ripper Project anurag News 1 07-23-2009 04:22 PM
Bookshelf reduction: To digitize or not to digitize vivaldirules Lounge 15 12-06-2007 07:00 PM
how to digitize books user Workshop 13 10-05-2007 05:07 PM
Digitize a paper book in 15 minutes! spinoza Sony Reader 17 11-09-2006 12:56 PM
How to digitize a million books Bob Russell Workshop 0 03-01-2006 06:10 PM


All times are GMT -4. The time now is 02:02 PM.


MobileRead.com is a privately owned, operated and funded community.