Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Formats > Workshop

Notices

Reply
 
Thread Tools Search this Thread
Old 09-01-2014, 02:22 AM   #1
tsolignani
Zealot
tsolignani juggles running chainsaws for a bit of light exercisetsolignani juggles running chainsaws for a bit of light exercisetsolignani juggles running chainsaws for a bit of light exercisetsolignani juggles running chainsaws for a bit of light exercisetsolignani juggles running chainsaws for a bit of light exercisetsolignani juggles running chainsaws for a bit of light exercisetsolignani juggles running chainsaws for a bit of light exercisetsolignani juggles running chainsaws for a bit of light exercisetsolignani juggles running chainsaws for a bit of light exercisetsolignani juggles running chainsaws for a bit of light exercisetsolignani juggles running chainsaws for a bit of light exercise
 
tsolignani's Avatar
 
Posts: 117
Karma: 38608
Join Date: May 2012
Location: Vignola, Modena, Italy
Device: iPad
Best way to scan to PDF with a scansnap

I scan many paper books, for I prefer to have them on electronic format.

After several tries, I ended up using PDF with OCR.

I own a Scansnap iX500 and Acrobat Pro.

Which way would you suggest me to scan the books? I mean DPI, image parameters (BW, gray scale, color) ecc.?

Thank you.

Tiziano
tsolignani is offline   Reply With Quote
Old 09-01-2014, 07:04 AM   #2
Toxaris
Wizard
Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.
 
Toxaris's Avatar
 
Posts: 4,520
Karma: 121692313
Join Date: Oct 2009
Location: Heemskerk, NL
Device: PRS-T1, Kobo Touch, Kobo Aura
I always do it 400 dpi in grayscale. I strongly advice ABBYY for the OCR part. Be prepared to have quite some postprocessing anyway.
Toxaris is offline   Reply With Quote
Advert
Old 09-02-2014, 05:00 AM   #3
tsolignani
Zealot
tsolignani juggles running chainsaws for a bit of light exercisetsolignani juggles running chainsaws for a bit of light exercisetsolignani juggles running chainsaws for a bit of light exercisetsolignani juggles running chainsaws for a bit of light exercisetsolignani juggles running chainsaws for a bit of light exercisetsolignani juggles running chainsaws for a bit of light exercisetsolignani juggles running chainsaws for a bit of light exercisetsolignani juggles running chainsaws for a bit of light exercisetsolignani juggles running chainsaws for a bit of light exercisetsolignani juggles running chainsaws for a bit of light exercisetsolignani juggles running chainsaws for a bit of light exercise
 
tsolignani's Avatar
 
Posts: 117
Karma: 38608
Join Date: May 2012
Location: Vignola, Modena, Italy
Device: iPad
Thank you.

Doing 400 DPI means that if I would use, say, 600 DPI, or anyway a «better» resolution, would lead to worse results? Or rather are you choosing 400 DPI as a good compromise between file size and quality?

How come you suggest ABBY over Acrobat? For the OCR performance alone or else?

And for post processing you mean solving OCR problems or what?

Please forgive me for asking so many questions, I just would like to get it right before doing a great batch of books.

Thank you and have a nice day.
tsolignani is offline   Reply With Quote
Old 09-02-2014, 06:02 AM   #4
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by tsolignani View Post
Doing 400 DPI means that if I would use, say, 600 DPI, or anyway a «better» resolution, would lead to worse results? Or rather are you choosing 400 DPI as a good compromise between file size and quality?
It is an ok compromise between filesize and quality. 300 DPI would be the lowest DPI I would go, that is good enough for accurate OCR.

Anything higher would be icing on the cake (although much larger filesize). If your book is full of images, you may want to scan those at a higher DPI, so you have more to work with if you are fixing/editing them.

Just last month, there was this topic, "DPI to use when scanning images":

https://www.mobileread.com/forums/sho...d.php?t=243418

Quote:
Originally Posted by tsolignani View Post
How come you suggest ABBY over Acrobat? For the OCR performance alone or else?
ABBYY Finereader is the most accurate OCR. More accurate OCR, means much less man-hours in "post-processing" fixing the mistakes.

If you already own Adobe Acrobat Pro, then meh, that OCR is probably fine, but I would make a strong case for going with Finereader over all others.

There was also this topic, also from about a month ago, in which OCR was discussed. I would also recommend visiting the topic I linked to in Post #6 (which leads to even more sets of in-depth topics discussing the subject matter):

https://www.mobileread.com/forums/sho...d.php?t=243327

Quote:
Originally Posted by tsolignani View Post
And for post processing you mean solving OCR problems or what?
Yep, PDF is an abysmal input format. There is lots of work that has to be done to get the text into good shape. (See topics linked above for the details).

Quote:
Originally Posted by tsolignani View Post
Please forgive me for asking so many questions, I just would like to get it right before doing a great batch of books.
So say we all!

I don't know how anyone else feels though, but it seems like every few weeks you get the same exact "How do I convert a PDF to ebook" questions. So I started just cross-linking to the previous topics with my previous tomes answers + everyone else's discussions/ideas.

I think those topics + the mountain of other linked material will answer almost all of your PDF -> OCR -> text questions. If you have any more, of course, feel free to ask.

There is also this topic in the MobileRead Wiki, although some of that info might be a tiny bit dated:

https://wiki.mobileread.com/wiki/Digi...ooks_to_Ebooks

Last edited by Tex2002ans; 09-02-2014 at 06:22 AM.
Tex2002ans is offline   Reply With Quote
Old 09-02-2014, 07:17 AM   #5
drjenkins
Addict
drjenkins ought to be getting tired of karma fortunes by now.drjenkins ought to be getting tired of karma fortunes by now.drjenkins ought to be getting tired of karma fortunes by now.drjenkins ought to be getting tired of karma fortunes by now.drjenkins ought to be getting tired of karma fortunes by now.drjenkins ought to be getting tired of karma fortunes by now.drjenkins ought to be getting tired of karma fortunes by now.drjenkins ought to be getting tired of karma fortunes by now.drjenkins ought to be getting tired of karma fortunes by now.drjenkins ought to be getting tired of karma fortunes by now.drjenkins ought to be getting tired of karma fortunes by now.
 
Posts: 250
Karma: 1702156
Join Date: Nov 2010
Device: Kindle Voyage
I use a ScanSnap iX500 to scan books and magazines to PDF. If you intend PDF to be your destination format, use Acrobat for OCR. If you want a flowing format to be your destination format use Abby FineReader to OCR to a Word document.

Most often I scan to PDF using "Best Mode", Color & Grayscale 300 dpi.

For those of you without a ScanSnap, it comes bundled with Acrobat Pro (for Windows) and Abby FineReader. Your scan resolution options are:
  • Normal Mode - Color & Grayscale 150 dpi, Monochrome 300 dpi
  • Better Mode - Color & Grayscale 200 dpi, Monochrome 400 dpi
  • Best Mode - Color & Grayscale 300 dpi, Monochrome 600 dpi
  • Excellent Mode - Color & Grayscale 600 dpi, Monochrome 1200 dpi
drjenkins is offline   Reply With Quote
Advert
Old 09-02-2014, 07:31 AM   #6
Jellby
frumious Bandersnatch
Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.
 
Jellby's Avatar
 
Posts: 7,515
Karma: 18512745
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
So there's something better than "best"?
Jellby is offline   Reply With Quote
Old 09-02-2014, 09:13 AM   #7
Toxaris
Wizard
Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.
 
Toxaris's Avatar
 
Posts: 4,520
Karma: 121692313
Join Date: Oct 2009
Location: Heemskerk, NL
Device: PRS-T1, Kobo Touch, Kobo Aura
I get no more than average results with 300DPI, especially with older paperbacks. With 400DPI my OCR errors are drastically reduced.

The better/cleaner the source, the better the result. However, keep in mind that the OCR software has to 'guess' a lot of stuff, like paragraphs, outlining, etc. That is where the post-processing comes into play. That is why I created my Word add-in to help me with that and greatly reduce the time I would need for the postprocessing.
Toxaris is offline   Reply With Quote
Old 09-02-2014, 09:31 AM   #8
drjenkins
Addict
drjenkins ought to be getting tired of karma fortunes by now.drjenkins ought to be getting tired of karma fortunes by now.drjenkins ought to be getting tired of karma fortunes by now.drjenkins ought to be getting tired of karma fortunes by now.drjenkins ought to be getting tired of karma fortunes by now.drjenkins ought to be getting tired of karma fortunes by now.drjenkins ought to be getting tired of karma fortunes by now.drjenkins ought to be getting tired of karma fortunes by now.drjenkins ought to be getting tired of karma fortunes by now.drjenkins ought to be getting tired of karma fortunes by now.drjenkins ought to be getting tired of karma fortunes by now.
 
Posts: 250
Karma: 1702156
Join Date: Nov 2010
Device: Kindle Voyage
Quote:
Originally Posted by Jellby View Post
So there's something better than "best"?
Yes, as Monty Burns would say, "Excellent".
drjenkins is offline   Reply With Quote
Old 09-06-2014, 12:48 PM   #9
tsolignani
Zealot
tsolignani juggles running chainsaws for a bit of light exercisetsolignani juggles running chainsaws for a bit of light exercisetsolignani juggles running chainsaws for a bit of light exercisetsolignani juggles running chainsaws for a bit of light exercisetsolignani juggles running chainsaws for a bit of light exercisetsolignani juggles running chainsaws for a bit of light exercisetsolignani juggles running chainsaws for a bit of light exercisetsolignani juggles running chainsaws for a bit of light exercisetsolignani juggles running chainsaws for a bit of light exercisetsolignani juggles running chainsaws for a bit of light exercisetsolignani juggles running chainsaws for a bit of light exercise
 
tsolignani's Avatar
 
Posts: 117
Karma: 38608
Join Date: May 2012
Location: Vignola, Modena, Italy
Device: iPad
Thank you.

I also use an iX500 Fujitsu and it does not let me scan at 400 dpi, it's only 300 or 600, whereas at 600 it's incredibly slow in scanning and I guess would end up with a huge file.

I don't want to end up with a flowable text, too much work to do, I am happy with a PDF with a layer of text for searching, bookmarking, highlighting and such.

I'll have a look at the links posted.

Thanks again.
tsolignani is offline   Reply With Quote
Old 09-14-2014, 12:44 PM   #10
markom
Banned
markom ought to be getting tired of karma fortunes by now.markom ought to be getting tired of karma fortunes by now.markom ought to be getting tired of karma fortunes by now.markom ought to be getting tired of karma fortunes by now.markom ought to be getting tired of karma fortunes by now.markom ought to be getting tired of karma fortunes by now.markom ought to be getting tired of karma fortunes by now.markom ought to be getting tired of karma fortunes by now.markom ought to be getting tired of karma fortunes by now.markom ought to be getting tired of karma fortunes by now.markom ought to be getting tired of karma fortunes by now.
 
Posts: 488
Karma: 1080260
Join Date: Sep 2012
Device: sony prs t1 kindle dx ipad
Quote:
Originally Posted by tsolignani View Post
Thank you.

I also use an iX500 Fujitsu and it does not let me scan at 400 dpi, it's only 300 or 600, whereas at 600 it's incredibly slow in scanning and I guess would end up with a huge file.

I don't want to end up with a flowable text, too much work to do, I am happy with a PDF with a layer of text for searching, bookmarking, highlighting and such.

I'll have a look at the links posted.

Thanks again.
You can also use Acrobat's ClearScan mode instead of exact pdf image, to get a lot smaller pdf files for similar quality.

Also when you've got double paged scans or pdf, you can use Abbyy FineReader because it's very good in automatically splitting the pages, deskewing etc. and then you can process this Abbyy's pdf (saved as pdf image without ocr) in Acrobat's ClearScan mode for ocr-ing and smaller pdf.

I've noticed that ClearScan-ed pdfs are faster to flip through on my e-ink readers than pdfs produced with new Abbyy's 12 or those downloaded from archive.org.

Scan Tailor (for 300 dpi grayscale scans, without upscaling it to 600 in ScanTailor at the end and by choosing b/w instead of grayscale to remove shades and specks) is also great (and free) tool for splitting, deskewing, cropping (eliminating dark and empty margins), despeckling etc. scanned images at once, which afterwards can be ocr-ed in Acrobat(clearscan) or Abbyy.

Last edited by markom; 09-14-2014 at 04:03 PM.
markom is offline   Reply With Quote
Old 09-17-2014, 09:34 AM   #11
tsolignani
Zealot
tsolignani juggles running chainsaws for a bit of light exercisetsolignani juggles running chainsaws for a bit of light exercisetsolignani juggles running chainsaws for a bit of light exercisetsolignani juggles running chainsaws for a bit of light exercisetsolignani juggles running chainsaws for a bit of light exercisetsolignani juggles running chainsaws for a bit of light exercisetsolignani juggles running chainsaws for a bit of light exercisetsolignani juggles running chainsaws for a bit of light exercisetsolignani juggles running chainsaws for a bit of light exercisetsolignani juggles running chainsaws for a bit of light exercisetsolignani juggles running chainsaws for a bit of light exercise
 
tsolignani's Avatar
 
Posts: 117
Karma: 38608
Join Date: May 2012
Location: Vignola, Modena, Italy
Device: iPad
Thank you. I don't like the ClearScan technology as I had a bad experience with the text's layer inside some pdf of mine I had scanned with that, the text got garbled and I ended up with no way to run OCR on them again. And I don't like the way fonts appear afterwards. I guess I'll stick with ordinary OCR despite the much bigger size.

I didn't know Scan Tailor, I'll have a look.

Thank you.


–
cordialmente,

tiziano solignani, da  Mac
http://blog.solignani.it
http://www.parolesottovetro.it
tsolignani is offline   Reply With Quote
Old 09-17-2014, 09:39 AM   #12
Ghitulescu
Fanatic
Ghitulescu ought to be getting tired of karma fortunes by now.Ghitulescu ought to be getting tired of karma fortunes by now.Ghitulescu ought to be getting tired of karma fortunes by now.Ghitulescu ought to be getting tired of karma fortunes by now.Ghitulescu ought to be getting tired of karma fortunes by now.Ghitulescu ought to be getting tired of karma fortunes by now.Ghitulescu ought to be getting tired of karma fortunes by now.Ghitulescu ought to be getting tired of karma fortunes by now.Ghitulescu ought to be getting tired of karma fortunes by now.Ghitulescu ought to be getting tired of karma fortunes by now.Ghitulescu ought to be getting tired of karma fortunes by now.
 
Posts: 563
Karma: 403106
Join Date: Aug 2014
Device: PRS-T1
Isn't the Fuji a document scanner? Id est it scans only papers (probably both sides at once), but not books?
Ghitulescu is offline   Reply With Quote
Old 09-17-2014, 09:48 AM   #13
tsolignani
Zealot
tsolignani juggles running chainsaws for a bit of light exercisetsolignani juggles running chainsaws for a bit of light exercisetsolignani juggles running chainsaws for a bit of light exercisetsolignani juggles running chainsaws for a bit of light exercisetsolignani juggles running chainsaws for a bit of light exercisetsolignani juggles running chainsaws for a bit of light exercisetsolignani juggles running chainsaws for a bit of light exercisetsolignani juggles running chainsaws for a bit of light exercisetsolignani juggles running chainsaws for a bit of light exercisetsolignani juggles running chainsaws for a bit of light exercisetsolignani juggles running chainsaws for a bit of light exercise
 
tsolignani's Avatar
 
Posts: 117
Karma: 38608
Join Date: May 2012
Location: Vignola, Modena, Italy
Device: iPad
I cut books, I remove the back, then scan.
tsolignani is offline   Reply With Quote
Old 09-17-2014, 11:16 AM   #14
markom
Banned
markom ought to be getting tired of karma fortunes by now.markom ought to be getting tired of karma fortunes by now.markom ought to be getting tired of karma fortunes by now.markom ought to be getting tired of karma fortunes by now.markom ought to be getting tired of karma fortunes by now.markom ought to be getting tired of karma fortunes by now.markom ought to be getting tired of karma fortunes by now.markom ought to be getting tired of karma fortunes by now.markom ought to be getting tired of karma fortunes by now.markom ought to be getting tired of karma fortunes by now.markom ought to be getting tired of karma fortunes by now.
 
Posts: 488
Karma: 1080260
Join Date: Sep 2012
Device: sony prs t1 kindle dx ipad
Quote:
Originally Posted by tsolignani View Post
Thank you. I don't like the ClearScan technology as I had a bad experience with the text's layer inside some pdf of mine I had scanned with that, the text got garbled and I ended up with no way to run OCR on them again. And I don't like the way fonts appear afterwards. I guess I'll stick with ordinary OCR despite the much bigger size.

I didn't know Scan Tailor, I'll have a look.

...]
You should've just saved that ClearScan as images (e.g. with Acrobat, five minutes for average book I guess).

That way you'll get original scans, to run OCR on them again.

For Scan Tailor it's the best to use several double page scans for some practice beforehand.

It's fairly easy though, even for novices, we just have to get through those six steps, usually just selecting "apply to all" at each step (for an average text book there) when we are satisfied with the outcome on one page.

1. Fix Orientation
2. Split Pages
3. Deskew
4. Select Content
5. Margins
6. Output (for djvu files they recommend 600 dpi output but for pdf it's usually good enough to stay at original 300, that way it'll be also a lot faster than upscaling it to 600)

https://www.youtube.com/watch?v=TNVNOiCpqhs

https://www.youtube.com/watch?v=dHZmTYTVL44

https://www.youtube.com/watch?v=ngj_MB2MlDM

Last edited by markom; 09-17-2014 at 02:00 PM.
markom is offline   Reply With Quote
Old 09-18-2014, 02:46 AM   #15
tsolignani
Zealot
tsolignani juggles running chainsaws for a bit of light exercisetsolignani juggles running chainsaws for a bit of light exercisetsolignani juggles running chainsaws for a bit of light exercisetsolignani juggles running chainsaws for a bit of light exercisetsolignani juggles running chainsaws for a bit of light exercisetsolignani juggles running chainsaws for a bit of light exercisetsolignani juggles running chainsaws for a bit of light exercisetsolignani juggles running chainsaws for a bit of light exercisetsolignani juggles running chainsaws for a bit of light exercisetsolignani juggles running chainsaws for a bit of light exercisetsolignani juggles running chainsaws for a bit of light exercise
 
tsolignani's Avatar
 
Posts: 117
Karma: 38608
Join Date: May 2012
Location: Vignola, Modena, Italy
Device: iPad
Quote:
Originally Posted by markom View Post
You should've just saved that ClearScan as images (e.g. with Acrobat, five minutes for average book I guess).

That way you'll get original scans, to run OCR on them again.
I tried that (exporting to TIFF single files, then back again into a single PDF) a couple of times, but the quality I ended up with was really awful so I gave up. The fonts were absolutely bad...

Thanks again.


–
cordialmente,

tiziano solignani, da  Mac
http://blog.solignani.it
http://www.parolesottovetro.it
tsolignani is offline   Reply With Quote
Reply

Tags
dpi, ocr, pdf, scan, scansnap


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Scan to PDF (for Reading Software) mblack3 Workshop 6 05-03-2013 06:17 PM
Book scan -> pdf -> Kindle Touch - problems rainsparade PDF 4 05-29-2012 01:55 PM
cleanup post scan PDF file wastewater Workshop 1 01-23-2012 10:43 AM
Filling in gaps in a PDF scan Sparrow Workshop 0 08-10-2009 02:50 PM
Please Help with scan PDF on my Sony reader nalbagli Sony Reader 15 06-02-2009 10:21 AM


All times are GMT -4. The time now is 12:00 PM.


MobileRead.com is a privately owned, operated and funded community.