Optimize PDFs from archive.org for E-Ink devices

ctop · 02-25-2020, 11:28 PM

The internet archive at archive has a lot of interesting books for borrowing and downloading. I have some downloads of older books, that are difficult to read on E-Ink devices because they include the background of the page, which has become yellow. So the contrast is low and the text becomes unclear, also the files are quite big. So I wonder if somebody knows a good way to trim the PDFs for ereaders. I would prefer to use a commandline on a Linux based system, if such a tool is available here.
An example of the PDFs I am looking at is this:

https://archive.org/details/smtliche...ge/n8/mode/2up

(This is the item page, the download link is here

https://archive.org/download/smtlich...r16goet_bw.pdf

Any help appreciated, Ctop

Tex2002ans · 02-26-2020, 12:55 AM

Quote:

Originally Posted by ctop

[...] the background of the page, which has become yellow. So the contrast is low and the text becomes unclear, [...]

I would prefer to use a commandline on a Linux based system, if such a tool is available here.

GUI-based:

Scan Tailor Advanced:

https://github.com/4lex4/scantailor-advanced

There isn't another tool like it.

If you want commandline, then there's nothing better than ImageMagick, but you'll have to come up with all the tweaks yourself.

There was also "What’s your “image rehab” routine?" from 2013 which discussed some image cleanup ideas. Although that mostly focused on cleaning up images within scans.

Side Note: Archive.org's B&W versions are usually okay. In this case, it requires lots of manual intervention. Go back to the color PDF (or like GrannyGrump mentions in the thread above, use the original JPEG2000 files), and do all your cleaning from there.

This specific file also has a lot of bleeding through the pages, so that may make your job extra harder when trying to darken text.

Quote:

Originally Posted by ctop

also the files are quite big. So I wonder if somebody knows a good way to trim the PDFs for ereaders.

Scan Tailor Advanced should be able to do all the chopping/cropping/contrast adjustments for you. But if you need even more PDF tweaking beyond that, then there's k2pdfopt, by willus.

ctop · 02-26-2020, 02:52 AM

Quote:

Originally Posted by Tex2002ans

GUI-based:

Scan Tailor Advanced:

https://github.com/4lex4/scantailor-advanced

There isn't another tool like it.
.

Thanks. I was somehow hoping that I could just clean the images without disturbing the text layer. I have been using scantailor (though not the advanced version, thanks for pointing that out) for books I scanned myself, and am quite pleased with the results. So it seems what you are saying, it is best to throw away all the post-processing already done and start from the images. Sigh, with a GUI based program that is quite a lot of work...

All the best,
Ctop

doubleshuffle · 02-26-2020, 05:37 AM

Why not fix the epub and upload it to the MR library? Will be much nicer on your reader, and also a service to the community.

ctop · 02-26-2020, 06:56 AM

Quote:

Originally Posted by doubleshuffle

Why not fix the epub and upload it to the MR library? Will be much nicer on your reader, and also a service to the community.

I had not even thought about that. I will have a look and see if it can be done in a reasonable timeframe.

Ctop

doubleshuffle · 02-26-2020, 01:13 PM

It is a lot of work, no denying that. But your pdf-fixing efforts sound pretty complicated too, so that's what gave me the idea.

I only now had a look at the book you have in mind. That's huuuge, of course, and seriously a lot of work.

BTW, there's a very nice epub edition of Goethe's works in our library, provided by pynch. But I'm not sure if the scientific writngs are complete in that one.

doubleshuffle · 02-26-2020, 01:18 PM

Just had a look at the txt file of the book - a very clean OCR result with surprisingly few errors. Fixing the epub may really be the way to go here.

Tex2002ans · 02-26-2020, 07:41 PM

Quote:

Originally Posted by ctop

I was somehow hoping that I could just clean the images without disturbing the text layer.

Yeah, that's the one disadvantage of Scan Tailor, it recreates/morphs the original text.

But if you're using it for personal copies, or a pre-processor for more accurate OCR, it's great.

The nice thing about it is you can also do page-by-page adjustments, and see how the final output will look. For example, speckle cleanup is fantastic, and you can see the diffs and adjust the strength if necessary.

Quote:

Originally Posted by ctop

I have been using scantailor (though not the advanced version, thanks for pointing that out) for books I scanned myself, and am quite pleased with the results.

The original is not maintained any more, while the other forks added lots of functionality (like better multi-threading—you can see the entire enhancement list on Github).

Scan Tailor Advanced combines all the best functionality from all of them, and I believe it's the only one actively maintained.

Quote:

Originally Posted by ctop

So it seems what you are saying, it is best to throw away all the post-processing already done and start from the images.

Yes. Archive.org just does a whole host of automated conversions... and I wouldn't use them if you could help it.

I usually just stick with their:

1. B&W PDF. Usually this is decent. In the case of this specific "yellowed book", it was crap.

2. Color PDF. This matches what they show in their online reader. Helpful if working with color, drawings, or "yellowed books". (You can do your own contrast/color corrections from this, and create a better grayscale/B&W version.)

3. As a last resort, work directly from the JPEG2000 images. These are the highest resolution/quality.

Do not touch their "EPUBs" or any of their other "ebook" formats (they are just automatically run through OCR, no proofing or anything). You're better off working from the source files and recreating your own OCR/ebooks from that.

Plus, if you have access to newer tools, you may get even more accurate conversion (according to the metadata, Finereader 8 was used, where Finereader 12+ is probably more accurate).

PS. If you need me to run any images/PDFs (pre-processed or not) through Finereader 12, just let me know.

Quote:

Originally Posted by ctop

Sigh, with a GUI based program that is quite a lot of work...

You can always automate any pre-processing steps with ImageMagick. For example, I was working on a book with scanning artifacts that ran vertically through the text:

Detecting/Removing Vertical Scanlines from Scans

So it could be used to clean up the images, then run through further corrections/tools after.

But with ImageMagick... you'll have to spend time figuring out all the commands + recreating fixes that may already exist.

For example, Scan Tailor already does a fantastic job of dewarping, detecting and cropping spines+edges-of-pages, [...].

If you go pure commandline ImageMagick... you'll have to figure out all those algorithms on your own. (Plus each book is going to have its own unique challenges.)

hobnail · 02-26-2020, 08:34 PM

Quote:

Originally Posted by doubleshuffle

Just had a look at the txt file of the book - a very clean OCR result with surprisingly few errors. Fixing the epub may really be the way to go here.

I've also done it using the txt file and depending on the quality of the scan and the original book it can be a painful amount of work.

Pajamaman · 02-26-2020, 10:50 PM

I suggest you try koreader. It contains ocr and reflow capacity on the fly. It also has contrast.

On another note, does anyone know a pdf tool that can ocr text that curves up at the end of a line as a result of the edge of a book page not being flat when scanned?

Tex2002ans · 02-27-2020, 01:34 AM

Quote:

Originally Posted by Pajamaman

On another note, does anyone know a pdf tool that can ocr text that curves up at the end of a line as a result of the edge of a book page not being flat when scanned?

You have to dewarp the images. Scan Tailor Advanced can do that.

Convert the PDF into PNG or TIFF images, run Scan Tailor on them, then go back to PDF.

doubleshuffle · 02-27-2020, 01:52 AM

Quote:

Originally Posted by Tex2002ans

Yes. Archive.org just does a whole host of automated conversions... and I wouldn't use them if you could help it.

I usually just stick with their:

1. B&W PDF. Usually this is decent. In the case of this specific "yellowed book", it was crap.

2. Color PDF. This matches what they show in their online reader. Helpful if working with color, drawings, or "yellowed books". (You can do your own contrast/color corrections from this, and create a better grayscale/B&W version.)

3. As a last resort, work directly from the JPEG2000 images. These are the highest resolution/quality.

Do not touch their "EPUBs" or any of their other "ebook" formats (they are just automatically run through OCR, no proofing or anything). You're better off working from the source files and recreating your own OCR/ebooks from that.

I always use the original image files and run them through ABBYY, but not everybody has that, and then working from the text or epub files at archive.org is an option. Especially when their OCR is as clean as in this case.

Quote:

Originally Posted by hobnail

I've also done it using the txt file and depending on the quality of the scan and the original book it can be a painful amount of work.

No denying this.

willus · 02-27-2020, 11:16 PM

Quote:

Originally Posted by ctop

The internet archive at archive has a lot of interesting books for borrowing and downloading. I have some downloads of older books, that are difficult to read on E-Ink devices because they include the background of the page, which has become yellow. So the contrast is low and the text becomes unclear, also the files are quite big. So I wonder if somebody knows a good way to trim the PDFs for ereaders. I would prefer to use a commandline on a Linux based system, if such a tool is available here.
An example of the PDFs I am looking at is this:

https://archive.org/details/smtliche...ge/n8/mode/2up

(This is the item page, the download link is here

https://archive.org/download/smtlich...r16goet_bw.pdf

Any help appreciated, Ctop

The k2pdfopt app fits most of what you want (e.g. command-line, linux). It has a thread here in the PDF forum on MR. The command-line options below worked pretty well with your link above:

k2pdfopt -mode fitwidth -bpc 2 -n- -ls- -ac example1.pdf

If you want to try it on just a few pages first, add something like:

-p 1-40

Example conversion of pages 30-39 is attached.

The only thing is that the file size of the converted PDF will be even bigger because the original is actually very well compressed (fitting 900 bitmapped pages into 30 MB is no small trick--it uses JPEG 2000 JPX compression, whereas k2pdfopt converts it to .png lossless compression, which is not as compact). I used -bpc 2 to get the converted file size down a little.

Tex2002ans · 02-28-2020, 02:22 AM

Quote:

Originally Posted by willus

The k2pdfopt app fits most of what you want (e.g. command-line, linux). It has a thread here in the PDF forum on MR.

Fantastic work as always. Yes, if you wanted to keep it in PDF form... your tool is always best. :P

Quote:

Originally Posted by ctop

An example of the PDFs I am looking at is this:

https://archive.org/details/smtliche...ge/n8/mode/2up

(This is the item page, the download link is here

https://archive.org/download/smtlich...r16goet_bw.pdf

Any help appreciated

But if you want to take steps in making the PDF a proper ebook:

I grabbed this book and ran it through Scan Tailor Advanced + Finereader 12.

1. Finereader 12 did a MUCH better job with colored PDF's "yellowed pages", and had no issues creating a B&W version.

I attached it as [Finereader][BW].

(You can see how much better 12 converts compared to 8.)

Side Note: I manually erased markings from the first few pages, so they look pure white... just ignore that in your comparisons.

Note: Alternatively, you could've fed color images into Scan Tailor directly (it has 3 different methods to convert to B&W/Grayscale + you can mess with the thresholds).

2. I exported the B&W PDF into PNGs, then ran that through Scan Tailor Advanced.

I spent about an hour going through the various stages, and Scan Tailor did a FANTASTIC job at automatically picking all correct boxes. The page edges are nearly all gone.

I would say 95%+ I didn't have to touch at all.

Side Note: Despeckling + Outputting has gotten so much faster/better compared to how it used to be. And I only had to use Despeckling on a handful of pages to remove the occasional stray dots. (Being able to see the before/afters marked with red is an enormous help. This is one step where GUI beats the pants off of pure commandline.)

3. I took the Scan Tailor images, and reimported them into Finereader 12, ran OCR, and output as:

PDF = [ScanTailor][Finereader][BW].pdf. (30 MBs is too large to attach, so here's a download.)
EPUB = [Finereader].epub.

You can compare the text, and see how much more accurate 12 is compared to Archive.org's "EPUB". (Most importantly, the headers+page numbers are nearly all automatically removed and not clogging the text.)

4. I took Finereader's EPUB and ran it through my usual "Finereader cleanup Regex":

Attached it as [Finereader][CodeCleanup].epub.

Comparison Images

Archive.org Color PDF + Finereader B&W + Scan Tailor Cleanup:

Click image for larger version

Name: smtlichewer16goet.-.p16-17[Original.Color].jpg
Views: 1141
Size: 242.5 KB
ID: 177414

Click image for larger version

Name: smtlichewer16goet.-.p16-17[Finereader].png
Views: 1113
Size: 284.4 KB
ID: 177416

Click image for larger version

Name: smtlichewer16goet.-.p16-17[ScanTailor].png
Views: 1844
Size: 289.4 KB
ID: 177415

ctop · 02-28-2020, 06:52 AM

Quote:

Originally Posted by willus

The k2pdfopt app fits most of what you want (e.g. command-line, linux). It has a thread here in the PDF forum on MR. The command-line options below worked pretty well with your link above:

k2pdfopt -mode fitwidth -bpc 2 -n- -ls- -ac example1.pdf

If you want to try it on just a few pages first, add something like:

-p 1-40

Example conversion of pages 30-39 is attached.

Wow, this looks really great, exactly what I had in mind! Awesome! One question though, the file you created has the page breaks at different places than the original, which is astonishing. What is the reason for this?

And one more question, since I like to highlight things in my PDFs, is the text layer the same as before, or does k2pdfopt do its own OCR?

All the best,

Ctop

02-25-2020, 11:28 PM	#1
ctop Connoisseur Posts: 63 Karma: 43710 Join Date: Jun 2008 Device: zaurus->palm->iPad->Sony PRS-T1,T2,T3->KoboForma&Likebook Ares->Palma2	Optimize PDFs from archive.org for E-Ink devices The internet archive at archive has a lot of interesting books for borrowing and downloading. I have some downloads of older books, that are difficult to read on E-Ink devices because they include the background of the page, which has become yellow. So the contrast is low and the text becomes unclear, also the files are quite big. So I wonder if somebody knows a good way to trim the PDFs for ereaders. I would prefer to use a commandline on a Linux based system, if such a tool is available here. An example of the PDFs I am looking at is this: https://archive.org/details/smtliche...ge/n8/mode/2up (This is the item page, the download link is here https://archive.org/download/smtlich...r16goet_bw.pdf Any help appreciated, Ctop

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
archive.org downloads	abrogard	Calibre	2	08-11-2018 07:08 PM
Archive.org	crutledge	General Discussions	129	08-28-2015 07:22 AM
do you try to optimize for different devices?	sarah_pnix	ePub	5	02-16-2011 06:05 AM
PDFs are blank when dled from archive.org	rakista	enTourage Archive	1	05-16-2010 10:58 AM
Archive.org copyright question	Hatgirl	General Discussions	7	03-23-2010 08:58 PM

02-26-2020, 05:37 AM	#4
doubleshuffle Unicycle Daredevil Posts: 13,944 Karma: 185432100 Join Date: Jan 2011 Location: Planet of the Pudding Brains Device: Aura HD (R.I.P. After six years the USB socket died.) tolino shine 3	Why not fix the epub and upload it to the MR library? Will be much nicer on your reader, and also a service to the community.

02-26-2020, 01:13 PM	#6
doubleshuffle Unicycle Daredevil Posts: 13,944 Karma: 185432100 Join Date: Jan 2011 Location: Planet of the Pudding Brains Device: Aura HD (R.I.P. After six years the USB socket died.) tolino shine 3	It is a lot of work, no denying that. But your pdf-fixing efforts sound pretty complicated too, so that's what gave me the idea. I only now had a look at the book you have in mind. That's huuuge, of course, and seriously a lot of work. BTW, there's a very nice epub edition of Goethe's works in our library, provided by pynch. But I'm not sure if the scientific writngs are complete in that one.

02-26-2020, 01:18 PM	#7
doubleshuffle Unicycle Daredevil Posts: 13,944 Karma: 185432100 Join Date: Jan 2011 Location: Planet of the Pudding Brains Device: Aura HD (R.I.P. After six years the USB socket died.) tolino shine 3	Just had a look at the txt file of the book - a very clean OCR result with surprisingly few errors. Fixing the epub may really be the way to go here.

02-26-2020, 10:50 PM	#10
Pajamaman Wizard Posts: 2,889 Karma: 10700629 Join Date: May 2016 Location: Canada Device: Onyx Nova	I suggest you try koreader. It contains ocr and reflow capacity on the fly. It also has contrast. On another note, does anyone know a pdf tool that can ocr text that curves up at the end of a line as a result of the edge of a book page not being flat when scanned?