MobileRead Forums - View Single Post - Optimize PDFs from archive.org for E-Ink devices

Tex2002ans · 02-26-2020, 07:41 PM

Quote:

Originally Posted by ctop

I was somehow hoping that I could just clean the images without disturbing the text layer.

Yeah, that's the one disadvantage of Scan Tailor, it recreates/morphs the original text.

But if you're using it for personal copies, or a pre-processor for more accurate OCR, it's great.

The nice thing about it is you can also do page-by-page adjustments, and see how the final output will look. For example, speckle cleanup is fantastic, and you can see the diffs and adjust the strength if necessary.

Quote:

Originally Posted by ctop

I have been using scantailor (though not the advanced version, thanks for pointing that out) for books I scanned myself, and am quite pleased with the results.

The original is not maintained any more, while the other forks added lots of functionality (like better multi-threading—you can see the entire enhancement list on Github).

Scan Tailor Advanced combines all the best functionality from all of them, and I believe it's the only one actively maintained.

Quote:

Originally Posted by ctop

So it seems what you are saying, it is best to throw away all the post-processing already done and start from the images.

Yes. Archive.org just does a whole host of automated conversions... and I wouldn't use them if you could help it.

I usually just stick with their:

1. B&W PDF. Usually this is decent. In the case of this specific "yellowed book", it was crap.

2. Color PDF. This matches what they show in their online reader. Helpful if working with color, drawings, or "yellowed books". (You can do your own contrast/color corrections from this, and create a better grayscale/B&W version.)

3. As a last resort, work directly from the JPEG2000 images. These are the highest resolution/quality.

Do not touch their "EPUBs" or any of their other "ebook" formats (they are just automatically run through OCR, no proofing or anything). You're better off working from the source files and recreating your own OCR/ebooks from that.

Plus, if you have access to newer tools, you may get even more accurate conversion (according to the metadata, Finereader 8 was used, where Finereader 12+ is probably more accurate).

PS. If you need me to run any images/PDFs (pre-processed or not) through Finereader 12, just let me know.

Quote:

Originally Posted by ctop

Sigh, with a GUI based program that is quite a lot of work...

You can always automate any pre-processing steps with ImageMagick. For example, I was working on a book with scanning artifacts that ran vertically through the text:

Detecting/Removing Vertical Scanlines from Scans

So it could be used to clean up the images, then run through further corrections/tools after.

But with ImageMagick... you'll have to spend time figuring out all the commands + recreating fixes that may already exist.

For example, Scan Tailor already does a fantastic job of dewarping, detecting and cropping spines+edges-of-pages, [...].

If you go pure commandline ImageMagick... you'll have to figure out all those algorithms on your own. (Plus each book is going to have its own unique challenges.)