MobileRead Forums - View Single Post - Automation of scanning on phone or tablet

Tex2002ans · 11-08-2021, 11:40 PM

Quote:

Originally Posted by anonlivros

I'm glad the tutorial continues to help more people. Thanks for share, Tex!

Quote:

Originally Posted by anonlivros

3) I also strongly recommend using Scantailor for dewarp and crop, generating B&W images. This step must be prior to abbyy finereader.

ScanTailor Advanced.

And the steps would go:

Images/Scans
- Follow the anonlivros tutorial (and/or DIY Book Scanner forums)
ScanTailor Advanced
- Crop + Dewarp + Grayscale/B&W
ABBYY Finereader (or Tesseract)
- Optical Character Recognition (OCR)

This would get you digitized PDFs with text backend, which would allow you to search/copy/paste, etc.

From there, you can do the usual PDF->EPUB steps. (Which are their own in-depth workflows.)

Quote:

Originally Posted by anonlivros

I know, I know... The tutorial needs to be updated.

I've learned some interesting things over the last few months, especially from sharing experience with Tex. (Which I reserve a gratitude that I couldn't properly express in this post)

I didn't check on the Github to see if it was expanded since.

And we definitely have to meetup and talk again before the end of the year!

Quote:

Originally Posted by Quoth

BUT my 2002 vintage flatbed scanner with optional ADF is close to 30 M pixels at 600 dpi for a larger book. The advantage of a camera is a V shaped holder to avoid damaging the spine. Pirates and professional scanning of cheap common books cut off the spine and use a duplex ADF.

Any of these methods work, but I'd rank everything as a balance of these 4 categories:

Non-Destructive->Destructive

Camera
Scanner*
Feed Scanner

High->Low Quality

Scanner
Feed Scanner
Camera*

Fast->Slow Speed

Camera
Feed Scanner
Scanner

High->Low Labor

Scanner
Camera
Feed Scanner

Non-Destructive vs. Destructive

If the book must stay in-tact, then Scanner or Camera.

If you don't mind destroying the book, then cutting the spine off + feeding it into the scanner as a stack of paper saves tons of time+labor.

* Note: If you have very fragile/large books, the Scanner may be too rough on the spine, so your only non-destructive choice is Camera + V-shaped holder.

Quality

The most important, because every later stage depends on this.

Remember, digitizing books is an entire process, and getting pictures of pages is just Stage 1.

High quality input:

High DPI (300/600+) + great/even lighting + non-warped pages
= very accurate OCR + little image cleanup

Low quality images:

Low DPI + bad lighting (haloing) + warped pages
= OCR has more errors + more typos/formatting issues + fixing images takes longer

You may even have to redo a lot of your work when you stumble across a non-recoverable error later on. (Like a photograph/chart/graph being distorted beyond fixing... or horrible speckling that only appears when you try to B&W your image.)

* Note: 20+ years ago, most cameras were still too low DPI. The images may have been readable to a human, but feeding that lower-quality image through OCR, you'd get a much higher error rate.

Within the past 10 years though, the quality of your typical cellphone camera has dramatically improved. Now, everyone carries something in their pockets that may work "well enough".

This is where anonlivros's gooseneck+cellphone method comes into play. It's an extremely cheap (<$50) way of reusing materials you most likely already have (cellphone + lamp)... getting you 80%+ of the way there.

* * *

Recommendation: If doing conversion professionally though, I'd recommend a superb quality scanning company like Golden Images. He was recommended by Hitch+me quite a few times over the years.

The immaculate quality scans will save you all that time+labor in the long-run.

Higher quality input = better+faster conversion with less errors.

* * *

Speed

Cameras, when you get the workflow down, can take a few seconds per page.

Scanners, at higher DPI, are very slow.

Labor

Scanners+Cameras require you to turn pages, make sure everything is lined up, hold books down, no fingers in the way, etc.

Feed scanners, you stick in a stack of paper and can go off doing something else.

Quote:

Originally Posted by Quoth

Google uses a camera and the V shaped book holder. However much of their output is only good enough for "Search" and thus poor quality.

It's quite good. And all the raw images are still there as JPEG2000. They are much higher quality than the auto-generated B&W/Color PDFs.

A lot of this was discussed a few months ago in:

2021: "Archive.org ePub"

I even showed the difference between the auto-generated Archive.org "EPUB"s vs. EPUBs generated out right out of Finereader with minimal intervention.

Note: Although again, you can rapidly get 80%+ of the way there in your PDF->EPUB, but that final quality push is what takes up the majority of time:

Headings + TOCs
Splitting chapters
Formatting
- bold/italics + super-/sub-script
- blockquotes/poetry/indexes/scenebreaks
- Alignment (left/center/right)
Fixing tables/figures/images
Clickable footnotes
Typo corrections
[...]

99.99% text accuracy also seems fantastic... until you actually read a book with .01% errors in it. (At least a few errors on every page.)

Quote:

Originally Posted by Quoth

A lot is on the Internet Archive and all of it as search results in Google Books. They don't bother with proofing. They also ought to have lost the court case because they are scanning and storing complete copyright works without the copyright holder's permission.

Fair Use. It's a library. And you don't "need permission".

There was even an article on that court case a few days ago:

Techdirt.com: "Internet Archive Would Like To Know What The Association Of American Publishers Is Hiding"

and a few days before that, on the ridiculousness of "library ebook licenses":

Techdirt.com: "Publishers Want To Make Ebooks More Expensive And Harder To Lend For Libraries; Ron Wyden And Anna Eshoo Have Questions"

Quote:

A UK campaign to fight that development in the world of academic publishing, called #ebookSOS, spells out the problems. Ebooks are frequently unavailable to institutions to license as ebooks. When they are on offer, they can be ten or more times the cost of the same paper book. [...] One title cost £29.99 as a physical book, and £1,306.32 for a single-user ebook license. As if those prices weren't high enough, it's common for publishers to raise the cost with no warning, and to withdraw ebook licenses already purchased. [...]

Quote:

Publishers are increasingly offering titles via an etextbook model, via third party companies, licensing content for use by specific, very restricted, cohorts of students on an annual basis. Quotes for these are usually hundreds, or sometimes thousands, times more than a print title, and this must be paid each year for new cohorts of students to gain access. This is exclusionary, restricts interdisciplinary research, and is unsustainable.

On Controlled Digital Lending (CDL), I'd also recommend checking out these fantastic webinars given by Internet Archive a few months ago:

Libraries already use these methods + Interlibrary Loans to loan scans of their own works (especially fragile/rare/non-public-facing material).

This allows you to consolidate it, instead of every individual library having to (poorly) re-scan + re-digitize.

Similarly with the Blind/Low-Vision/Disabilities associations + universities within each country all independently digitizing. Why? Why waste all that time and effort when you can do it once, in high-quality PDFs/EPUBs, then lend it from there?

On universities digitizing, see the great webinar:

DAISY Consortium: "The Accessible EPUB Ecosystem in Action: Following the Journey from Publisher to Student"
DAISY Consortium: "Describing Images in Publications – Guidance, Best Practices and the Promise of Technology"
- University explaining best practices for describing images (alt text) to students who request more accessible formats/materials.
- They also had an absolutely fantastic 3-part series on "The Art and Science of Describing Images"

On Copyright/Libraries + "Permission Culture" see:

Side Note: Luckily, Internet Archive has been archiving scholarly articles as well:

"Search Scholarly Materials Preserved in the Internet Archive"

And speaking of digitizing + Link Rot... ~18% websites cited within scholarly articles are already dead:

"Internet Archive Releases Refcat, the IA Scholar Index of over 1.3 Billion Scholarly Citations"

This knowledge needs to be preserved + easily searchable/accessible.