MobileRead Forums - View Single Post

nabsltd · 01-10-2017, 09:33 PM

Quote:

Originally Posted by pwalker8

First off, you grossly underestimate the time and effort to produce a professional quality ebook. It's not just a case of scan, it's also a case of editing the result to catch all the scan errors and formatting.

Did you miss the part where I said I've already done this for crappy scans found on the Internet, and really doesn't take very long? If I get to start with a good scan that I have created, and train the OCR, you end up with very few errors, and it's quite fast, because the OCR only stops to ask you the first time it hits a letter (or letter grouping) it doesn't understand.

Basic formatting is then pretty easy, with a little programming that understands about dialog (quote at the end of a line usually ends a paragraph) and understand that the text is justified. This allows you to catch many paragraphs, and if you add the HTML wrapping (<p>) and temporarily format so that there is extra space between paragraphs, you get a good visual separator to help find the rest of the paragraphs.

You solve italics by cheating on the OCR training, and adding having it identify something like " Q" as " <em>Q".

I have done complete re-formats on enough eBooks (both professional source and the results of OCR) that I know I can do it in about 2 hours per book. If I have the original hardcover as a reference, when I do this kind of reformat, the final product is identical (with the exception of headers and footers, which I of course omit for the eBook), with the correct fonts, correct ratios of white space to page size for chapter start, scene breaks, etc. Ornaments on scene breaks are as original, unless they were only used for breaks that are widows/orphans. Chapter numbers/names and ornaments are correct. And yeah, this really only takes me about two hours per book.

Quote:

Second, what ever makes you think that publishers keep an electric copy of every book they have published since 1995?

I said "have access to", in that they can ask the author, who very likely has an electronic copy. In addition, an electronic copy is created by the publisher to feed to the computerized typesetter that prints the book. Until the publisher loses rights, they (or the typesetting provider, if a different company) will have that electronic version.

Quote:

Originally Posted by ZodWallop

In the U.S. at least, all of his major books have been released. The missing Asimov books are typically ones that are long OOP even in the paper world.

Again, that works out to 26 out of 49 of his fiction books, for 42%. Compare that to Harry Harrison with 42 out of 59 (71%), or Jack Higgins at 79% (61 of 77). Both of those are in about the same "out of print" state as Asimov.

For Asimov, I'm not counting the "hundreds" of books he "wrote" (even though I've read most of them). Remember that many of that count were just anthologies where he wrote glue for other authors, and the largest group was his collections of science essays. I don't expect to see any of those anytime soon, but the fiction novels and story collections should almost all be available.

Quote:

Originally Posted by pwalker8

It takes me a lot longer, but then again, I don't destroy the book to scan it (a lot of people will cut the pages out of the binding to be able to feed it into a fast scanner, I use a camera rig that doesn't damage the book) and I'm not a professional editor.

Buy a cheap paperback and destroy it. It will give you a far better scan because the pages will be perfectly flat and aligned. Even using a flatbed scanner (which would take about the same amount of time per page as your rig), you'd still save the time in less OCR fixes. In particular, you'd be able to trivially crop the scan to get only the text and not headers/footers.

If you collect each chapter together, you don't even care about scanning the chapter name/number...just add that manually. But, I'd wait till later, and with each chapter saved as something like "01.html", and then run a script to add chapter numbers just after the "<body>" tag.