Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Formats > Workshop

Notices

Reply
 
Thread Tools Search this Thread
Old 03-09-2012, 03:11 PM   #1
crackhammer
Enthusiast
crackhammer began at the beginning.
 
Posts: 38
Karma: 10
Join Date: Jun 2009
Device: Nook touch, iPad, Xoom
Any suggestions for workflow improvement?

Hello folks,

My current workflow is includes following steps
- Scan on Epson Perfection V300 Photo (as 300 dpi color tiff image)
- ScanTailor to trim and prefect the scanned images (export as 300 dpi tiff image)
- Acrobat to assemble into pdf and OCR (as searchable image)

I understand that Acrobat OCR is not the best but my primary goal is to be able to highlight the text. I have hardly ever copied and pasted scanned book text anywhere for any purpose so Acrobat does the job for me.

What issue I face sometimes is the large file size of the output file. Does anyone have any recommendation in change of workflow so that I get optimum file size with a good quality image? I searched up and down Acrobat forums, tweaked options but didn't yield much, so I thought may be I should play with the input files but I don't have much clue on images so I am asking question here.

(P.S. - I installed tesseract-ocr from google codes but couldn't figure out how to use it, any idea? Hope doesn't need knowledge of programming)
crackhammer is offline   Reply With Quote
Old 03-09-2012, 05:35 PM   #2
DSpider
Evangelist
DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.
 
DSpider's Avatar
 
Posts: 427
Karma: 326969
Join Date: Nov 2009
Location: Romania
Device: PW2 2014
Because you need a GUI for tesseract.

About ScanTailor, it outputs uncompressed TIFF images by default, at 600 dpi. If you chose black and white instead of grayscale, Acrobat should've taken care of the compression. But whatever you do, don't resample them because it adds unnecessary antialiasing (blurriness), which doesn't compress very well, effectively making it NOT a 1 bit image anymore.
DSpider is offline   Reply With Quote
 
Advertisement
Old 03-09-2012, 05:45 PM   #3
crackhammer
Enthusiast
crackhammer began at the beginning.
 
Posts: 38
Karma: 10
Join Date: Jun 2009
Device: Nook touch, iPad, Xoom
So you are saying that I export images in ScanTailor at 300dpi and that is making the image a little blurry? I was doing so because my input image is 300 dpi so I was thinking that by exporting it at 600dpi, I am just adding in pixels those are not there.
As I said, my understanding of images is zero. I tried to look into it but it was rather tough nut for my little brain.

And, any link for GUI?

Thanks
crackhammer is offline   Reply With Quote
Old 03-09-2012, 05:49 PM   #4
osnova
Kindler of the Flame
osnova ought to be getting tired of karma fortunes by now.osnova ought to be getting tired of karma fortunes by now.osnova ought to be getting tired of karma fortunes by now.osnova ought to be getting tired of karma fortunes by now.osnova ought to be getting tired of karma fortunes by now.osnova ought to be getting tired of karma fortunes by now.osnova ought to be getting tired of karma fortunes by now.osnova ought to be getting tired of karma fortunes by now.osnova ought to be getting tired of karma fortunes by now.osnova ought to be getting tired of karma fortunes by now.osnova ought to be getting tired of karma fortunes by now.
 
osnova's Avatar
 
Posts: 583
Karma: 646016
Join Date: Oct 2009
Location: US of A
Device: K DX,3,KT,KP,KF, KFHD; Nook C, PRS600, iPad, Xoom, N900, N810, Zaurus
You could improve the OCR results with the following steps:

1. scan 300 dpi grayscale tiff (jpg loses too much data)
2. process scans with ScanKromsator and upscale to 600 dpi B/W. It's hard to learn ScanKromsator but it's worth it, it's the best (and free!) scan processing app out there. Look for a tutorial called Scan and Share. If used correctly, the results out of SK exceed all my expectations and look much better than the original.
3. OCR with FineReader
4. Save to pdf in FineReader with the text layer. Or just use the OCR text to create an epub or mobi.

Quote from the Internet:
Quote:
I would say ScanKromsator is the best tool I have ever dealt with for the process of creating ebooks from scans. It just processes images, splitting dual pages, autodetecting borders, adjusting page width and height, and a bunch of other basic operations that can be batch-processed and which are not usually found in any other software. The interface is quite complex and with many settings, while good documentation on program use is lacking, but anyway, it is worth the pain of learning by trial and error.
The best scanner for these purposes is OpticBook 3600 (or even more expensive models from Plustek).

Last edited by osnova; 03-09-2012 at 06:45 PM.
osnova is offline   Reply With Quote
Old 03-10-2012, 11:58 AM   #5
DSpider
Evangelist
DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.
 
DSpider's Avatar
 
Posts: 427
Karma: 326969
Join Date: Nov 2009
Location: Romania
Device: PW2 2014
I meant you can scan at 300 dpi, import them into Scan Tailor, and Scan Tailor outputs 600 dpi (by default). Don't worry about "adding in pixels that are not there". Pages will look just fine because the antialiasing that I was telling you about, is done "on-the-fly" by the e-reader trying to display it on a smaller scale.



Can you guess which one compresses better?

Again, do not resample what you get from Scan Tailor using Acrobat. The 600 dpi images you get from Scan Tailor are just fine, as long they're 1 bit images (black and white), with none of that gradient stuff when you zoom in.

Last edited by DSpider; 03-10-2012 at 12:01 PM.
DSpider is offline   Reply With Quote
Old 04-13-2012, 10:01 AM   #6
ProDigit
Karmaniac
ProDigit ought to be getting tired of karma fortunes by now.ProDigit ought to be getting tired of karma fortunes by now.ProDigit ought to be getting tired of karma fortunes by now.ProDigit ought to be getting tired of karma fortunes by now.ProDigit ought to be getting tired of karma fortunes by now.ProDigit ought to be getting tired of karma fortunes by now.ProDigit ought to be getting tired of karma fortunes by now.ProDigit ought to be getting tired of karma fortunes by now.ProDigit ought to be getting tired of karma fortunes by now.ProDigit ought to be getting tired of karma fortunes by now.ProDigit ought to be getting tired of karma fortunes by now.
 
ProDigit's Avatar
 
Posts: 2,157
Karma: 9023682
Join Date: Oct 2008
Location: Miami FL
Device: PRS-505, Jetbook, Jetbook Mini, Jetbook Color, Astak Ez Reader Pro
I would suggest you to find (at least part) of the book online in text (written) form, and rip it from there!
OCR takes a lot of time.

If your OCR makes some errors often (eg it reads the letter 'L' from the word 'all', as a number '1' and thus displays 'a11', just use notepad++ and replace all words 'a11' to 'all'.
Creating an advanced set of rules will save you a lot of correction/proof reading time.
ProDigit is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Kindle Touch software improvement suggestions markbot Amazon Kindle 10 11-30-2011 02:17 PM
That's an improvement! caheaton Kobo Reader 8 04-11-2011 02:04 PM
PDF panning really needs improvement! cleanskin Amazon Kindle 2 08-25-2010 04:52 AM
Opinion on workflow (and enhancing it) - research-type workflow TheDarkTrumpet Which one should I buy? 8 03-02-2009 11:41 AM


All times are GMT -4. The time now is 01:30 PM.


MobileRead.com is a privately owned, operated and funded community.