MobileRead Forums - View Single Post

Tex2002ans · 05-29-2021, 07:52 PM

Quote:

Originally Posted by salamanderjuice

One issue I've had with their PDFs is they don't do any sort of correction for yellowed pages so on a B&W eReader they can look like serious junk with banding in the background. Other than that it's fine.

Yep, their automatic Color->B&W doesn't work well for all books. (Though most do perfectly find.)

But the great thing about Archive.org is they release all the source files.

So if you have problems with the B&W PDF, then instead download the:

Color PDF
Original source images [JPEG2000]

If you check out Post #4+#6 in that Tutorial thread, I showed the why/how.

You can then use Scan Tailor Advanced in order correct "yellowed pages" -> B&W. Using that allows you to tweak all the variables to get a much better/cleaner B&W image.

* * *

And they're always tweaking their workflows.

Like in December 2020, they rescanned/rereleased the entire "Computerworld" magazine from microfilm:

https://blog.archive.org/2020/12/30/...age-microfilm/

Microfilm scanning technology has gotten much better since it was first digitized, so now a much higher quality release is available.

Quote:

Originally Posted by salamanderjuice

I also can't really blame them for automating this stuff, they just have way too much content and often it's the only place to get it on the web. I needed a chapter from some 70 year old niche book recently and they had it, only other option was a university library 6 hours away that was closed anyways due to COVID.

Like GrannyGrump's conversion of the original Sweeney Todd story: "The String of Pearls":

https://www.mobileread.com/forums/sh...d.php?t=299744
https://archive.org/details/stringof...e/n13/mode/2up

I think that book was locked away in Oxford University, one of the only copies left in the world, and it's not even available to the public.

Now because of Archive.org, the entire world can read it.

Quote:

Originally Posted by Quoth

The problem is that none of it is human curated or proofed. It's automated.

Yeah, but the scale is on a completely different level.

99.9999% accuracy on a few hundreds (maybe thousands) of books per year on Gutenberg.

vs.

99% OCR accuracy on millions of books. (And all original source files are available.)

And the scope is different too:

Sure, you get the nice ebooks (I really wish Gutenberg released the original PDFs though)...

But Archive.org is actually about making the works available/searchable. (NOT automating perfect ebooks. Those converted formats are just a side addition.)