Any suggestions for workflow improvement?

crackhammer · 03-09-2012, 02:11 PM

Hello folks,

My current workflow is includes following steps
- Scan on Epson Perfection V300 Photo (as 300 dpi color tiff image)
- ScanTailor to trim and prefect the scanned images (export as 300 dpi tiff image)
- Acrobat to assemble into pdf and OCR (as searchable image)

I understand that Acrobat OCR is not the best but my primary goal is to be able to highlight the text. I have hardly ever copied and pasted scanned book text anywhere for any purpose so Acrobat does the job for me.

What issue I face sometimes is the large file size of the output file. Does anyone have any recommendation in change of workflow so that I get optimum file size with a good quality image? I searched up and down Acrobat forums, tweaked options but didn't yield much, so I thought may be I should play with the input files but I don't have much clue on images so I am asking question here.

(P.S. - I installed tesseract-ocr from google codes but couldn't figure out how to use it, any idea? Hope doesn't need knowledge of programming)

DSpider · 03-09-2012, 04:35 PM

Because you need a GUI for tesseract.

About ScanTailor, it outputs uncompressed TIFF images by default, at 600 dpi. If you chose black and white instead of grayscale, Acrobat should've taken care of the compression. But whatever you do, don't resample them because it adds unnecessary antialiasing (blurriness), which doesn't compress very well, effectively making it NOT a 1 bit image anymore.

crackhammer · 03-09-2012, 04:45 PM

So you are saying that I export images in ScanTailor at 300dpi and that is making the image a little blurry? I was doing so because my input image is 300 dpi so I was thinking that by exporting it at 600dpi, I am just adding in pixels those are not there.
As I said, my understanding of images is zero. I tried to look into it but it was rather tough nut for my little brain.

And, any link for GUI?

Thanks

osnova · 03-09-2012, 04:49 PM

You could improve the OCR results with the following steps:

1. scan 300 dpi grayscale tiff (jpg loses too much data)
2. process scans with ScanKromsator and upscale to 600 dpi B/W. It's hard to learn ScanKromsator but it's worth it, it's the best (and free!) scan processing app out there. Look for a tutorial called Scan and Share. If used correctly, the results out of SK exceed all my expectations and look much better than the original.
3. OCR with FineReader
4. Save to pdf in FineReader with the text layer. Or just use the OCR text to create an epub or mobi.

Quote from the Internet:

Quote:

I would say ScanKromsator is the best tool I have ever dealt with for the process of creating ebooks from scans. It just processes images, splitting dual pages, autodetecting borders, adjusting page width and height, and a bunch of other basic operations that can be batch-processed and which are not usually found in any other software. The interface is quite complex and with many settings, while good documentation on program use is lacking, but anyway, it is worth the pain of learning by trial and error.

The best scanner for these purposes is OpticBook 3600 (or even more expensive models from Plustek).

DSpider · 03-10-2012, 10:58 AM

I meant you can scan at 300 dpi, import them into Scan Tailor, and Scan Tailor outputs 600 dpi (by default). Don't worry about "adding in pixels that are not there". Pages will look just fine because the antialiasing that I was telling you about, is done "on-the-fly" by the e-reader trying to display it on a smaller scale.

Can you guess which one compresses better?

Again, do not resample what you get from Scan Tailor using Acrobat. The 600 dpi images you get from Scan Tailor are just fine, as long they're 1 bit images (black and white), with none of that gradient stuff when you zoom in.

ProDigit · 04-13-2012, 09:01 AM

I would suggest you to find (at least part) of the book online in text (written) form, and rip it from there!
OCR takes a lot of time.

If your OCR makes some errors often (eg it reads the letter 'L' from the word 'all', as a number '1' and thus displays 'a11', just use notepad++ and replace all words 'a11' to 'all'.
Creating an advanced set of rules will save you a lot of correction/proof reading time.

03-09-2012, 02:11 PM	#1
crackhammer Enthusiast Posts: 47 Karma: 10 Join Date: Jun 2009 Device: Nook touch, iPad, Xoom	Any suggestions for workflow improvement? Hello folks, My current workflow is includes following steps - Scan on Epson Perfection V300 Photo (as 300 dpi color tiff image) - ScanTailor to trim and prefect the scanned images (export as 300 dpi tiff image) - Acrobat to assemble into pdf and OCR (as searchable image) I understand that Acrobat OCR is not the best but my primary goal is to be able to highlight the text. I have hardly ever copied and pasted scanned book text anywhere for any purpose so Acrobat does the job for me. What issue I face sometimes is the large file size of the output file. Does anyone have any recommendation in change of workflow so that I get optimum file size with a good quality image? I searched up and down Acrobat forums, tweaked options but didn't yield much, so I thought may be I should play with the input files but I don't have much clue on images so I am asking question here. (P.S. - I installed tesseract-ocr from google codes but couldn't figure out how to use it, any idea? Hope doesn't need knowledge of programming)

03-10-2012, 10:58 AM	#5
DSpider Evangelist Posts: 450 Karma: 343115 Join Date: Nov 2009 Location: Romania Device: PW2 2014	I meant you can scan at 300 dpi, import them into Scan Tailor, and Scan Tailor outputs 600 dpi (by default). Don't worry about "adding in pixels that are not there". Pages will look just fine because the antialiasing that I was telling you about, is done "on-the-fly" by the e-reader trying to display it on a smaller scale. Can you guess which one compresses better? Again, do not resample what you get from Scan Tailor using Acrobat. The 600 dpi images you get from Scan Tailor are just fine, as long they're 1 bit images (black and white), with none of that gradient stuff when you zoom in. Last edited by DSpider; 03-10-2012 at 11:01 AM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Kindle Touch software improvement suggestions	markbot	Amazon Kindle	10	11-30-2011 01:17 PM
That's an improvement!	caheaton	Kobo Reader	8	04-11-2011 01:04 PM
PDF panning really needs improvement!	cleanskin	Amazon Kindle	2	08-25-2010 03:52 AM
Opinion on workflow (and enhancing it) - research-type workflow	TheDarkTrumpet	Which one should I buy?	8	03-02-2009 10:41 AM

03-09-2012, 04:35 PM	#2
DSpider Evangelist Posts: 450 Karma: 343115 Join Date: Nov 2009 Location: Romania Device: PW2 2014	Because you need a GUI for tesseract. About ScanTailor, it outputs uncompressed TIFF images by default, at 600 dpi. If you chose black and white instead of grayscale, Acrobat should've taken care of the compression. But whatever you do, don't resample them because it adds unnecessary antialiasing (blurriness), which doesn't compress very well, effectively making it NOT a 1 bit image anymore.

03-09-2012, 04:45 PM	#3
crackhammer Enthusiast Posts: 47 Karma: 10 Join Date: Jun 2009 Device: Nook touch, iPad, Xoom	So you are saying that I export images in ScanTailor at 300dpi and that is making the image a little blurry? I was doing so because my input image is 300 dpi so I was thinking that by exporting it at 600dpi, I am just adding in pixels those are not there. As I said, my understanding of images is zero. I tried to look into it but it was rather tough nut for my little brain. And, any link for GUI? Thanks

04-13-2012, 09:01 AM	#6
ProDigit Karmaniac Posts: 2,553 Karma: 11499146 Join Date: Oct 2008 Location: Miami FL Device: PRS-505, Jetbook, + Mini, +Color, Astak Ez Reader Pro, PPW1, Aura H2O	I would suggest you to find (at least part) of the book online in text (written) form, and rip it from there! OCR takes a lot of time. If your OCR makes some errors often (eg it reads the letter 'L' from the word 'all', as a number '1' and thus displays 'a11', just use notepad++ and replace all words 'a11' to 'all'. Creating an advanced set of rules will save you a lot of correction/proof reading time.