Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Formats > Workshop

Notices

Reply
 
Thread Tools Search this Thread
Old 01-20-2025, 12:09 AM   #1
kwik
Enthusiast
kwik began at the beginning.
 
Posts: 30
Karma: 10
Join Date: Nov 2023
Device: Sony PRS-T3
book scanning - best practices?

I recently scanned my first book (Fast Food Nation) with a flatbed scanner using naps2. I scanned at 600dpi. Some pages were not perfectly straight, so it doesn't look professional, though I trimmed each image so no shadow is seen from the curved paper. The images were scanned at 600dpi, greyscale (only the cover was scanned in colour). This took me about a day.

Result
The PDF is searchable thanks to the built-in OCR in naps2. The PDF generated is about 500MB, which is unreasonable. I was able to reduce the rather larage PDF size to around 50MB with a PDF shrinkage app called Densify. Some quality is lost. I am not sure if I am doing things the correct way. I am probably not, right?

Questions
* I would like some tips & tricks to make the job easier / quicker.
* I would like to make an epub instead of PDF. Best way to go about this?
* I would like to be able to extract the OCR text separately instead of only having the OCR'd text searchable in the PDF.

I am looking for any tips & tricks that you might be willing to share to make scanning books easier / quicker / more efficient.
kwik is offline   Reply With Quote
Old 01-20-2025, 12:44 AM   #2
Karellen
Wizard
Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.
 
Karellen's Avatar
 
Posts: 1,611
Karma: 9500498
Join Date: Sep 2021
Location: Australia
Device: Kobo Libra 2
Not sure if you want to continue using pdf or convert to epub, but this thread is quite good, and in my post here I detail steps I used to scan a book and create an epub with links to software.

Last edited by Karellen; 01-20-2025 at 01:12 AM. Reason: fix link
Karellen is offline   Reply With Quote
Advert
Old 01-21-2025, 12:36 AM   #3
kwik
Enthusiast
kwik began at the beginning.
 
Posts: 30
Karma: 10
Join Date: Nov 2023
Device: Sony PRS-T3
Fantastic, thank you Karellen. I am mostly interested in epub because it displays so much better in my ereader, but I don't mind having both options available to me. I have installed ScanTailor Advanced, tesseract-ocr and gImageReader in Linux & will try them soon.
kwik is offline   Reply With Quote
Old 01-21-2025, 07:44 AM   #4
Turtle91
A Hairy Wizard
Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.
 
Turtle91's Avatar
 
Posts: 3,347
Karma: 20171571
Join Date: Dec 2012
Location: Charleston, SC today
Device: iPhone 15/11/X/6/iPad 1,2,Air & Air Pro/Surface Pro/Kindle PW & Fire
I think it was mentioned in the thread Karellen linked, but you can find lots of information on scanning books at DIY Book Scanner. In addition to the software they can help with your hardware setup… some of the more advanced setups boast several hundred pages per hour scanned and processed!
Turtle91 is offline   Reply With Quote
Old 01-21-2025, 09:35 AM   #5
nezih
Enthusiast
nezih is less competitive than you.nezih is less competitive than you.nezih is less competitive than you.nezih is less competitive than you.nezih is less competitive than you.nezih is less competitive than you.nezih is less competitive than you.nezih is less competitive than you.nezih is less competitive than you.nezih is less competitive than you.nezih is less competitive than you.
 
nezih's Avatar
 
Posts: 43
Karma: 14828
Join Date: Feb 2023
Device: Boox Page, Kobo Aura SE
  • Postprocess the scanned pages with ScanTailor (https://github.com/4lex4/scantailor-advanced), pretty easy to fix skewness you mentioned, among other things.
  • Merge the ScanTailor output files with Adobe Acrobat, OCR them via ClearScan (named "Editable text and images" in newer Acrobat DC versions). This will basically vectorize the OCRed text.
  • gImageReader is the only usable Tesseract GUI imo, however, if you can use Finereader, it can output the OCRed text in many formats, ePub being one of them. Since OCR is not %100 accurate creating pretty looking and proofread epubs is a very exhausting process but at least Finereader's epub output eases the chore a bit.
nezih is offline   Reply With Quote
Advert
Old 01-22-2025, 12:50 PM   #6
Quoth
Still reading
Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.
 
Quoth's Avatar
 
Posts: 14,010
Karma: 105092227
Join Date: Jun 2017
Location: Ireland
Device: All 4 Kinds: epub eink, Kindle, android eink, NxtPaper
No-one needs Adobe Acrobat.
Quoth is offline   Reply With Quote
Old 01-22-2025, 01:34 PM   #7
JSWolf
Resident Curmudgeon
JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.
 
JSWolf's Avatar
 
Posts: 79,740
Karma: 145864619
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
Quote:
Originally Posted by Quoth View Post
No-one needs Adobe Acrobat.
I do sometimes. But never novels or other books.The last thing I read using Acrobat was a manual for a TV.
JSWolf is offline   Reply With Quote
Old 01-22-2025, 04:10 PM   #8
nezih
Enthusiast
nezih is less competitive than you.nezih is less competitive than you.nezih is less competitive than you.nezih is less competitive than you.nezih is less competitive than you.nezih is less competitive than you.nezih is less competitive than you.nezih is less competitive than you.nezih is less competitive than you.nezih is less competitive than you.nezih is less competitive than you.
 
nezih's Avatar
 
Posts: 43
Karma: 14828
Join Date: Feb 2023
Device: Boox Page, Kobo Aura SE
For reading absolutely it should be avoided, however there is no open source alternative to Adobe ClearScan.
nezih is offline   Reply With Quote
Old 01-22-2025, 06:00 PM   #9
Turtle91
A Hairy Wizard
Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.
 
Turtle91's Avatar
 
Posts: 3,347
Karma: 20171571
Join Date: Dec 2012
Location: Charleston, SC today
Device: iPhone 15/11/X/6/iPad 1,2,Air & Air Pro/Surface Pro/Kindle PW & Fire
Quote:
Originally Posted by nezih View Post
For reading absolutely it should be avoided, however there is no open source alternative to Adobe ClearScan.
It doesn’t bother me at all which ocr software you wish to use, but for others who may be reading this, there isn’t really a huge benefit to Adobe ClearScan… Adobe themselves say there is no increase in accuracy for using it… just a decrease in file size. If your end format is epub, then you do not really care about the pdf file size.

Quote:
Is OCR accuracy any different between ClearScan and Searchable Image styles?
No. The accuracy will be identical for input files of the same dpi. However, since a ClearScan files are so much smaller, you might consider using a 600 dpi input file as a starting point since there is little downside other than processing time.

Can I make changes to the text in a ClearScan file?
No. The Touchup Text Tool does not currently work on ClearScan files.
If you want to avoid the hefty Adobe price/subscription then there are other options… some free…. that provide just as good OCR. Take a look at GOCR, FreeOCR, or SimpleOCR… among many others. They can ocr directly from images/tiffs and you won’t need to go the pdf route at all - for that path lies the dark side…

When all is said and done, OCR is still not perfect no matter which software you use. You will still have to read and make manual corrections. The key is getting the clearest scan/image that you can to begin with. Please see the DIY Bookscanner site for in depth discussions about how to get the best scan.
Cheers!

Last edited by Turtle91; 01-22-2025 at 06:06 PM.
Turtle91 is offline   Reply With Quote
Old 01-23-2025, 08:48 AM   #10
Sarmat89
Fanatic
Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.
 
Posts: 516
Karma: 2268308
Join Date: Nov 2015
Device: none
Quote:
Originally Posted by Quoth View Post
No-one needs Adobe Acrobat.
There is no alternative for it for creating PDF files.
Sarmat89 is offline   Reply With Quote
Old 01-23-2025, 05:53 PM   #11
Turtle91
A Hairy Wizard
Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.
 
Turtle91's Avatar
 
Posts: 3,347
Karma: 20171571
Join Date: Dec 2012
Location: Charleston, SC today
Device: iPhone 15/11/X/6/iPad 1,2,Air & Air Pro/Surface Pro/Kindle PW & Fire
Quote:
Originally Posted by Sarmat89 View Post
There is no alternative for it for creating PDF files.
Not sure what you are trying to say. There are obviously many, many, many other methods to create a pdf document other than using acrobat???!?!!

Can you please clarify what you mean?
Turtle91 is offline   Reply With Quote
Old 01-30-2025, 05:34 PM   #12
retiredbiker
Evangelist
retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.
 
retiredbiker's Avatar
 
Posts: 450
Karma: 3886916
Join Date: May 2013
Location: Ontario, Canada
Device: Kindle KB, Oasis, Pop_Os!, Kobo Forma
I made this book scanner years ago out of scrap wood. I have had a variety of lights and cameras on it, including the somewhat ridiculous looking LED floodlight and old video camera in the picture. But it does the job...I can comfortably scan a page about every 10 seconds.

The V-tray and glass on top of the book keeps it nice and flat...no need to correct for curl or keystoneing or whatever. Resolution is totally up to how I set the camera. 300dpi is usually fine for tesseract OCR.

OCRFeeder is the tesseract front-end I use. I always OCR page-by page to handle things like double or triple columns, advertisements, "continued on page 107" and so on. Also if there is a real scan/OCR problem, I discover it ON THAT PAGE, not later, buried somewhere in 100,000 words.

This gives me jpg images directly, no need to mess with PDF nonsense. I do use ScanTaylor sometimes if the original physical book is horrible. OCR the images, text into Writer for proofing and styling, straight to epub with Sigil or Calibre.
Attached Thumbnails
Click image for larger version

Name:	Book Scanner with Floodlight.JPG
Views:	140
Size:	352.1 KB
ID:	213339  
retiredbiker is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
What is the best book scanning service? norweger Workshop 15 05-13-2021 11:07 AM
Book Scanning Lordblacknail Workshop 1 10-13-2010 06:04 PM
How do you keep your sanity? scanning a book mypolar Workshop 9 01-28-2010 08:43 AM
Digitizing a book best practices Linus Workshop 1 07-13-2009 01:00 PM
Book scanning kusmi iRex 33 10-09-2007 05:34 AM


All times are GMT -4. The time now is 09:43 PM.


MobileRead.com is a privately owned, operated and funded community.