03-09-2012, 02:11 PM | #1 |
Enthusiast
Posts: 47
Karma: 10
Join Date: Jun 2009
Device: Nook touch, iPad, Xoom
|
Any suggestions for workflow improvement?
Hello folks,
My current workflow is includes following steps - Scan on Epson Perfection V300 Photo (as 300 dpi color tiff image) - ScanTailor to trim and prefect the scanned images (export as 300 dpi tiff image) - Acrobat to assemble into pdf and OCR (as searchable image) I understand that Acrobat OCR is not the best but my primary goal is to be able to highlight the text. I have hardly ever copied and pasted scanned book text anywhere for any purpose so Acrobat does the job for me. What issue I face sometimes is the large file size of the output file. Does anyone have any recommendation in change of workflow so that I get optimum file size with a good quality image? I searched up and down Acrobat forums, tweaked options but didn't yield much, so I thought may be I should play with the input files but I don't have much clue on images so I am asking question here. (P.S. - I installed tesseract-ocr from google codes but couldn't figure out how to use it, any idea? Hope doesn't need knowledge of programming) |
03-09-2012, 04:35 PM | #2 |
Evangelist
Posts: 450
Karma: 343115
Join Date: Nov 2009
Location: Romania
Device: PW2 2014
|
Because you need a GUI for tesseract.
About ScanTailor, it outputs uncompressed TIFF images by default, at 600 dpi. If you chose black and white instead of grayscale, Acrobat should've taken care of the compression. But whatever you do, don't resample them because it adds unnecessary antialiasing (blurriness), which doesn't compress very well, effectively making it NOT a 1 bit image anymore. |
03-09-2012, 04:45 PM | #3 |
Enthusiast
Posts: 47
Karma: 10
Join Date: Jun 2009
Device: Nook touch, iPad, Xoom
|
So you are saying that I export images in ScanTailor at 300dpi and that is making the image a little blurry? I was doing so because my input image is 300 dpi so I was thinking that by exporting it at 600dpi, I am just adding in pixels those are not there.
As I said, my understanding of images is zero. I tried to look into it but it was rather tough nut for my little brain. And, any link for GUI? Thanks |
03-09-2012, 04:49 PM | #4 | |
Kindler of the Flame
Posts: 582
Karma: 646016
Join Date: Oct 2009
Location: US of A
Device: K DX,3,KT,KP,KF, KFHD; Nook C, PRS600, iPad, Xoom, N900, N810, Zaurus
|
You could improve the OCR results with the following steps:
1. scan 300 dpi grayscale tiff (jpg loses too much data) 2. process scans with ScanKromsator and upscale to 600 dpi B/W. It's hard to learn ScanKromsator but it's worth it, it's the best (and free!) scan processing app out there. Look for a tutorial called Scan and Share. If used correctly, the results out of SK exceed all my expectations and look much better than the original. 3. OCR with FineReader 4. Save to pdf in FineReader with the text layer. Or just use the OCR text to create an epub or mobi. Quote from the Internet: Quote:
Last edited by osnova; 03-09-2012 at 05:45 PM. |
|
03-10-2012, 10:58 AM | #5 |
Evangelist
Posts: 450
Karma: 343115
Join Date: Nov 2009
Location: Romania
Device: PW2 2014
|
I meant you can scan at 300 dpi, import them into Scan Tailor, and Scan Tailor outputs 600 dpi (by default). Don't worry about "adding in pixels that are not there". Pages will look just fine because the antialiasing that I was telling you about, is done "on-the-fly" by the e-reader trying to display it on a smaller scale.
Can you guess which one compresses better? Again, do not resample what you get from Scan Tailor using Acrobat. The 600 dpi images you get from Scan Tailor are just fine, as long they're 1 bit images (black and white), with none of that gradient stuff when you zoom in. Last edited by DSpider; 03-10-2012 at 11:01 AM. |
04-13-2012, 09:01 AM | #6 |
Karmaniac
Posts: 2,553
Karma: 11499146
Join Date: Oct 2008
Location: Miami FL
Device: PRS-505, Jetbook, + Mini, +Color, Astak Ez Reader Pro, PPW1, Aura H2O
|
I would suggest you to find (at least part) of the book online in text (written) form, and rip it from there!
OCR takes a lot of time. If your OCR makes some errors often (eg it reads the letter 'L' from the word 'all', as a number '1' and thus displays 'a11', just use notepad++ and replace all words 'a11' to 'all'. Creating an advanced set of rules will save you a lot of correction/proof reading time. |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Kindle Touch software improvement suggestions | markbot | Amazon Kindle | 10 | 11-30-2011 01:17 PM |
That's an improvement! | caheaton | Kobo Reader | 8 | 04-11-2011 01:04 PM |
PDF panning really needs improvement! | cleanskin | Amazon Kindle | 2 | 08-25-2010 03:52 AM |
Opinion on workflow (and enhancing it) - research-type workflow | TheDarkTrumpet | Which one should I buy? | 8 | 03-02-2009 10:41 AM |