Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Formats > Workshop

Notices

Reply
 
Thread Tools Search this Thread
Old 02-20-2013, 04:33 PM   #1
Pumpkin Soup
Junior Member
Pumpkin Soup began at the beginning.
 
Posts: 3
Karma: 10
Join Date: Aug 2012
Device: iPad
OCR to EPUB Best Workflow

Hello! I've never done OCR before, and I wanted to hear what people's thoughts were on the best workflow. I have an idea of my workflow I'd pursue + some questions along the way. Please let me know if you have any thoughts to improve my process (or correct me if I got something horribly wrong or missed a step). Assuming cost was not an issue, do you...

Either
1a) Build a DIY book scanner where you can flip the book and snap photos of pages with a camera (D-SLR, Output as TIFF).
or
1a) Physically remove the spine of the book (ideally with a stack paper cutter for the most accurate cut), scan the cut-out pages in a stack scanner. What should the DPI be? Is it problematic to scan the pages into a combined PDF—or are individual TIFF pages preferable?

2) Take the scanned image/PDF files and run them through ABBYY Finereader (does any version of the software have better features than others?). If you have a complicated book (graphic elements you would like to remove) that you scanned into PDF, would ABBYY PDF Transformer be a worthwhile investment? (I've been rec'd ABBYY products the most, but please let me know if you prefer something else)

Time to export... do you...

3a) Export to EPUB, start working directly inside the ABBYY-created EPUB
3b) Export as HTML, manually take the HTML and create an EPUB from there
3c) Export as a Word document or PDF (assuming you have scripts to make this process easier), and take those files into InDesign to begin making an EPUB, then export a built EPUB from InDesign and continue editing from there.

Then, finished EPUB!

What are your thoughts? What is your preferred route? Thanks.
Pumpkin Soup is offline   Reply With Quote
Old 02-20-2013, 05:32 PM   #2
Turtle91
Guru
Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.
 
Turtle91's Avatar
 
Posts: 669
Karma: 3807234
Join Date: Dec 2012
Location: Shannon, Ireland today
Device: iPhone 5/iPad 1&2/Surface Pro/Kindle PW
Great question! You would probably find several different answers because people would suggest the process that they are most comfortable with - not necessarily the best. Take those recommendations and go with what works for you. I'll let you know what I do and recommend...

Build a DIY scanner!! Heck yeah...its fun...and even if you don't scan a single book with it, it was still fun building it! That's what I did. (the fun building it part, not the no scanning part ). BUT...it does cost a bit. If you only have a few books to scan, it's not worth the time/money - unless you just love tinkering. If you have a lot of books to scan, then your own scanner is indeed the best way to go.

If you are going to destroy a book by cutting the spine, then cost probably isn't an issue. Just save the time and buy the ebook. Personally I could never desecrate one of my babies that way.

No need to convert to TIFF....just leave them as .jpg. 300dpi is fine for text OCR and simple images/maps. If you want a hi quality image of the cover or artwork, then scan those separately and insert them in the ePub later.

www.diybookscanner.org has some great software that will automatically collate images from two different cameras, rotate them, deskew them, and even ocr them...all for free. Getting the software set up takes a little effort but works well. I haven't used it other than test runs. I use commercial OCR software.

I personally like Abbyy.... You can get it to do the image corrections, OCR, spell check, etc. and then output to ePub (version 10 or later). There is no reason to put in all those other formats, just to change it back to an ePub. Every time you perform an auto conversion, you will get more code that needs to be cleaned up later.

I also recommend using Sigil to do the final edits on your ePub. It lets you check/fix the html code, and do all the other stuff you need for making a clean/valid ePub automatically....and it's free. (insert prompt for donating to the Sigil developers fund here)

Cheers!
Turtle91 is offline   Reply With Quote
 
Advertisement
Old 02-21-2013, 02:43 AM   #3
DSpider
Evangelist
DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.
 
DSpider's Avatar
 
Posts: 433
Karma: 326969
Join Date: Nov 2009
Location: Romania
Device: PW2 2014
I suggest a cheap flatbed scanner, because today's scanners bought from your local brick and mortar store were considered professional-grade only a few years ago. They're great for OCR (even saved as JPG), and good enough for graphics, too. No point in dishing out $1k for a DSLR. Especially if you know how to work in Photoshop and Illustrator, and vectorize graphics such as charts, graphs, line art, chapter decorations, etc.

Focus on quality, rather than quantity. Scanning takes a lot less (~1h-4h, depending on the book) than actually processing it.

My workflow can be boiled down to this: Scan pages with text as JPG (300 dpi, grayscale) and pages with images as TIFF (600 dpi, colour), OCR in ABBYY FineReader Pro, proofread it, export as RTF, run my own macro that removes the styles and everything else, track down the fonts, process/vectorize the graphics, import the RTF into InDesign and redo the layout, proofread again the final product (basically I read the entire book twice in the process).

The result is usually of a very high quality and it's always a pleasure to read such a book. But not many people are willing to put in the time and effort - it can take up to a month if I work on it 2-3h each night. Lastly, make sure that the content is worth it, and that it's not already available as an e-book.

Last edited by DSpider; 02-21-2013 at 02:48 AM.
DSpider is offline   Reply With Quote
Old 02-21-2013, 03:11 AM   #4
najgori
Member
najgori can read with one handnajgori can read with one handnajgori can read with one handnajgori can read with one handnajgori can read with one handnajgori can read with one handnajgori can read with one handnajgori can read with one handnajgori can read with one handnajgori can read with one handnajgori can read with one hand
 
Posts: 20
Karma: 79590
Join Date: Sep 2011
Location: Belgrade, Serbia
Device: Kindle 4NT
check out plustek opticbook scanners. they are a little more expensive than average flatbed, but they are "nicer" to a book - not so destructive.
najgori is offline   Reply With Quote
Old 02-21-2013, 09:31 AM   #5
Turtle91
Guru
Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.
 
Turtle91's Avatar
 
Posts: 669
Karma: 3807234
Join Date: Dec 2012
Location: Shannon, Ireland today
Device: iPhone 5/iPad 1&2/Surface Pro/Kindle PW
Quote:
Originally Posted by DSpider View Post
No point in dishing out $1k for a DSLR.
Holy cow! Why would you want to pay so much for a scanner camera?? Mid to low end cameras these days are more than capable of doing the job...300 dpi for a 10 in x 12 in image (more than large enough for 99% of books) only needs a 10 Megapixel camera.

When I built my scanner, I got 2 cheap cameras for $60 - total.

Quote:
Originally Posted by DSpider View Post
Especially if you know how to work in Photoshop and Illustrator, and vectorize graphics such as charts, graphs, line art, chapter decorations, etc.
That is something I haven't done yet, but it sounds interesting. I get tired of mini graphics that don't scale very well. Do you know of a site with a tutorial on vectorization?? (I'm competent in photoshop - but no expert)

Quote:
Originally Posted by DSpider View Post
export as RTF, run my own macro that removes the styles and everything else
I'm not sure why you would want to export to RTF if you are going to remove all the styles anyway - why not just export to text?

Until Sigil came along I would export to HTML then use a text editor with a few regex find/replace actions that would clean up all the extraneous styling, but keep the bold/italics/scene changes etc. Then I would manually add a specific style to headers, and special sections of text (letters, poetry, etc.) That ended up taking about 30 minutes.

It's about the same amount of time with Sigil, but when I'm done I just hit save, and its a well formed, clean ePub. Then I use book view in sigil to proof read the book. I can correct any errors in the document as I go - very simple. Total time to scan and clean: 1.5 - 3 hrs. Time to proofread...depends on the length of my honey-do list...

Cheers!

Last edited by Turtle91; 02-21-2013 at 09:43 AM.
Turtle91 is offline   Reply With Quote
Old 02-21-2013, 09:40 AM   #6
Turtle91
Guru
Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.
 
Turtle91's Avatar
 
Posts: 669
Karma: 3807234
Join Date: Dec 2012
Location: Shannon, Ireland today
Device: iPhone 5/iPad 1&2/Surface Pro/Kindle PW
Quote:
Originally Posted by najgori View Post
check out plustek opticbook scanners. they are a little more expensive than average flatbed, but they are "nicer" to a book - not so destructive.
Yes, if you only have a few books to scan, a flatbed is the way to go. I have about 2000 books, so the amount of time saved with my DIYer is enormous.

One thing to be aware of when scanning on a flatbed - if you are holding the book flat to scan both pages at the same time the words get distorted when the page bends at the spine. Abbyy is pretty good at splitting the pages and running a correction algorithm for the page warp, but most OCR errors come from that problem. If you get a flatbed, get one with as thin a border around the scan surface as possible so that you can hold the book flat and scan one page at a time. (the thin border is to accommodate the small distance between the spine and where the print starts on a page) This will increase the time it takes to scan, but decrease the time in OCR correction.

Cheers!
Turtle91 is offline   Reply With Quote
Old 02-21-2013, 10:46 AM   #7
najgori
Member
najgori can read with one handnajgori can read with one handnajgori can read with one handnajgori can read with one handnajgori can read with one handnajgori can read with one handnajgori can read with one handnajgori can read with one handnajgori can read with one handnajgori can read with one handnajgori can read with one hand
 
Posts: 20
Karma: 79590
Join Date: Sep 2011
Location: Belgrade, Serbia
Device: Kindle 4NT
plustek scans one page per scan without warp. here's video
http://www.youtube.com/watch?v=DWCUdBbfddY

however, 2000 books is many scans
najgori is offline   Reply With Quote
Old 02-21-2013, 12:40 PM   #8
DSpider
Evangelist
DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.
 
DSpider's Avatar
 
Posts: 433
Karma: 326969
Join Date: Nov 2009
Location: Romania
Device: PW2 2014
Turtle91, even a cheap flatbed scanner will "see" well between the pages. Never had a problem with OCR-ing even with thick books and average-to-light pressure on the spine. If you're a complete beginner with vector graphics, Vector Magic will do the job nicely, but you'll still need to correct the output, using Adobe Illustrator or Inkscape. There are lots of tutorials on Youtube, on how to use, say, Inkscape to vectorize a bitmap image.

Quote:
Originally Posted by Turtle91 View Post
I'm not sure why you would want to export to RTF if you are going to remove all the styles anyway - why not just export to text?
It actually does that, initially. It breaks the text down into plain text, with each character surrounded by a tag (so that the bolds and italics do not get lost), then it puts it back as RTF. This results in a "squeaky clean" file, just right for processing.

http://www.mobileread.com/forums/sho...d.php?t=198297

PS: I hear that PlusTek scanners are not that good. They break down after after a year or so (but then again, cheap cameras aren't that good either). I would recommend a flatbed scanner for somewhere around $60-90. They're absolutely fine for all intended purposes. You would've paid over $800 for the same scan quality just a few years ago.

Last edited by DSpider; 02-21-2013 at 12:51 PM.
DSpider is offline   Reply With Quote
Old 02-21-2013, 01:00 PM   #9
Jellby
frumious Bandersnatch
Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.
 
Jellby's Avatar
 
Posts: 6,325
Karma: 4964183
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
Spine warp is usually not a problem with text, but it can be quite annoying if you are trying to get good scans from illustrations.
Jellby is offline   Reply With Quote
Old 02-21-2013, 02:34 PM   #10
Toxaris
Wizard
Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.
 
Toxaris's Avatar
 
Posts: 3,190
Karma: 7422141
Join Date: Oct 2009
Location: Heemskerk, NL
Device: PRS-300, PRS-T1
My process is simple.
1. cut off the spine
2. Put it through the sheetfeeder at my work and scan at 400dpi
3. Put it through ABBYY and save as DOCX, HTML and PDF/A (PDF/A for error hunting, HTML to see if all went well and for the images.
4. Run all my macro's minus 1
5. Run spellcheck
6. Run HTML export macro
7. Create book in Sigil including layout
Toxaris is offline   Reply With Quote
Old 02-22-2013, 07:15 AM   #11
MrPjax
Junior Member
MrPjax began at the beginning.
 
Posts: 7
Karma: 10
Join Date: Feb 2013
Device: Samsung Galaxy Tab 7 Plus
Quote:
Originally Posted by Toxaris View Post
2. Put it through the sheetfeeder at my work and scan at 400dpi
Once you get to this step, what format is your scan currently in? Is it jpg or pdf?
MrPjax is offline   Reply With Quote
Old 02-22-2013, 07:30 AM   #12
DSpider
Evangelist
DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.
 
DSpider's Avatar
 
Posts: 433
Karma: 326969
Join Date: Nov 2009
Location: Romania
Device: PW2 2014
Don't output straight to PDF. Never, ever do that with a scanner. Scan them as images, and once you have the images, process them (I recommend Scan Tailor), then wrap them up in a PDF using Adobe Acrobat or something similar.

OCR-ing text with FineReader works just as well with JPG, PNG or TIFF. I recommend JPG (80-85% compression), since they take up less space, and the error rate isn't any different than TIFF or PNG. In fact, if the image is too clean it can have a negative impact - such as detecting smudges, dust, printing defects as commas, dots or accents. JPG will "smooth" them out a bit and the OCR is actually a bit better. I only scan as TIFF (600 dpi) the pages that contain graphics.
DSpider is offline   Reply With Quote
Old 02-22-2013, 03:08 PM   #13
Toxaris
Wizard
Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.
 
Toxaris's Avatar
 
Posts: 3,190
Karma: 7422141
Join Date: Oct 2009
Location: Heemskerk, NL
Device: PRS-300, PRS-T1
They are standard JPG encapsulated in a PDF. If I have a source that is not that good, I unpack it with pdfimages and run scantailer. But most times I just load the PDF into ABBYY.
Toxaris is offline   Reply With Quote
Old 02-22-2013, 06:44 PM   #14
DSpider
Evangelist
DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.
 
DSpider's Avatar
 
Posts: 433
Karma: 326969
Join Date: Nov 2009
Location: Romania
Device: PW2 2014
I think OCR-ing a PDF with FineReader is a bad idea, because it basically takes a snapshot (or "screenshot") of that PDF. So if the PDF only contains compressed images (JPG), you'd basically be importing a snapshot of a compressed image, instead of just importing the compressed image. Get it?

If you find it way too convenient to import a single (PDF) file into FineReader instead of multiple files (one for each page), then scan them as a multi-page TIFF file.
DSpider is offline   Reply With Quote
Old 02-23-2013, 10:34 AM   #15
Toxaris
Wizard
Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.
 
Toxaris's Avatar
 
Posts: 3,190
Karma: 7422141
Join Date: Oct 2009
Location: Heemskerk, NL
Device: PRS-300, PRS-T1
I don't think it works like that. As far as I can tell it does use the jpg's in the pdf. I see no difference in quality and it works fine.
Toxaris is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
A workflow for generating epub files from InDesign Man Eating Duck ePub 5 01-27-2013 08:47 AM
Workflow - XHTML to mobi to ePub lissie Workshop 7 01-23-2013 04:22 AM
Persisting html-to-epub workflow Chaihana Joe Calibre 2 01-28-2012 06:37 PM
Smooth workflow from HTML to Sigil epub useroo Sigil 1 07-04-2011 01:31 AM
Opinion on workflow (and enhancing it) - research-type workflow TheDarkTrumpet Which one should I buy? 8 03-02-2009 11:41 AM


All times are GMT -4. The time now is 02:07 AM.


MobileRead.com is a privately owned, operated and funded community.