Optimize PDFs from archive.org for E-Ink devices - Page 2

ctop · 02-28-2020, 06:52 AM

Quote:

Originally Posted by Tex2002ans

But if you want to take steps in making the PDF a proper ebook:

I grabbed this book and ran it through Scan Tailor Advanced + Finereader 12.

[...]

You can compare the text, and see how much more accurate 12 is compared to Archive.org's "EPUB". (Most importantly, the headers+page numbers are nearly all automatically removed and not clogging the text.)

4. I took Finereader's EPUB and ran it through my usual "Finereader cleanup Regex":

Attached it as [Finereader][CodeCleanup].epub.

Thank you amazing work. This is now really a pleasure to read on my Ares. My takeaway is that it really pays to invest the time to use Scantailor. Especially the removal of the page headers is great. Did you describe the regexes you are using somewhere?

Looking forward to your blog:-)

All the best,

Ctop

Tex2002ans · 02-28-2020, 05:33 PM

Quote:

Originally Posted by ctop

Thank you amazing work. This is now really a pleasure to read on my Ares.

No problem.

Quote:

Originally Posted by ctop

My takeaway is that it really pays to invest the time to use Scantailor. Especially the removal of the page headers is great.

It's pretty good.

Quote:

Originally Posted by ctop

Did you describe the regexes you are using somewhere?

No, not that I remember (probably a good post to stick on the blog. I'll add it to my notes.).

I haven't changed my "Finereader Regex" in years... it's really just taking ~12 of Finereader's inline styles:

Code:

<span style="font-style:italic;">
<span style="font-weight:bold;">
<span style="font-variant:small-caps;">
<span style="font-weight:bold;font-variant:small-caps;">
[...]

and changing to CSS:

Code:

<span class="italics">
<span class="bold">
<span class="smallcaps">
<span class="smallcaps"> (smallcaps + bold is always an OCR/formatting error)
[...]

Then I just do an extra step to convert to:

Code:

<i>
<b>

I discussed some more of the process in:

2017 "Converting PDF/HTML to ereader formats"
2016 "Delete paragraphs in scanned books (S & R with regexes)"

It gives me a very clean base to work from, and then I can focus on the actual formatting/markup issues.

Quote:

Originally Posted by ctop

Looking forward to your blog:-)

Me too, me too.

I need to kick my butt into gear and get that blog up and running.

(That's one of my new years resolutions!)

Then I could just say "Here's everything I ever wrote on ImageMagick".

In January, I used it to cleanup 3000+ pages of journal articles. There were scanning artifacts along the top/right edges, plus dirt/smudge in the bottom right corners:

Click image for larger version

Name: 2_4_3-11[Orig].png
Views: 355
Size: 97.1 KB
ID: 177427

So I used ImageMagick to:

1. Crop ### pixels from the edges.
2. Fill ### pixels with white.

Click image for larger version

Name: 2_4_3-11-crop.png
Views: 357
Size: 96.0 KB
ID: 177428

but as I said, every PDF is going to bring unique challenges... a handful of random pages turned out like this:

Click image for larger version

Name: 2_4_8-7[Orig].png
Views: 380
Size: 176.4 KB
ID: 177429

and required further intervention. (Open the actual image and take a look. The MobileRead thumbnail looks okay, but you'll see the innards are actually multiple overlapping transparent boxes.)

And other pages, the headers/footers were too close to the edge, so my solution disappeared text. Without comparing the before/after closely, I would've never known certain text was clipped/missing. Again, why I stress a GUI is helpful.

Side Note: Here was another ImageMagick thread from a few months ago:

Converting pdf to png images

where I showed how to remove speckles + crop (the original's white margins were absolutely immense).

willus · 02-28-2020, 09:48 PM

Quote:

Originally Posted by ctop

Wow, this looks really great, exactly what I had in mind! Awesome! One question though, the file you created has the page breaks at different places than the original, which is astonishing. What is the reason for this?

And one more question, since I like to highlight things in my PDFs, is the text layer the same as before, or does k2pdfopt do its own OCR?

All the best,

Ctop

The default behavior of k2pdfopt in "fitwidth" mode is to concatenate pages as it fits them into the converted PDF, and it disregards page breaks in the source document. You can add the -bp option to force a page break in the converted document wherever there is a page break in the source. There are other options that are better if you prefer to have a 1-to-1 source page to converted page correlation. The k2pdfopt options are documented here.

By default, k2pdfopt keeps the OCR layer from the source PDF, but it can also do its own OCR.

willus · 02-29-2020, 07:00 PM

Quote:

Originally Posted by willus

... There are other options that are better if you prefer to have a 1-to-1 source page to converted page correlation. The k2pdfopt options are documented here.

I am realizing that in the original post for this thread, there were two different versions of the PDF--the black and white one (which I processed with my original post in this thread), and the yellowed version. To process the yellowed version, I needed to jack up the contrast quite a bit. I've tweaked this set of options to give one output page per input page--simply cropping each input page:

k2pdfopt -mode tm -c- -rt 0 -cmax -8 -bpc 2 -n- -dr 2 -ac yellowed.pdf -o processed.pdf

The "-mode tm" does the one page per page bit. "-cmax -8" to use a fixed contrast adjust on each page (the default auto adjust isn't aggressive enough for this document--you could probably stand to go even higher than 8). The "-dr 2" bumps up the output resolution by a factor of two over what it would normally be for a Kindle Paperwhite. Example input and output attached.

willus · 02-29-2020, 07:07 PM

Quote:

Originally Posted by Tex2002ans

Fantastic work as always. Yes, if you wanted to keep it in PDF form... your tool is always best. :P

Hey, the OP asked for a command-line linux tool to trim a PDF. Rarely have I had a request so well matched to k2pdfopt!

BTW, k2pdfopt can drop out PNG files for each page rather than a PDF:

k2pdfopt -mode tm -c- -rt 0 -cmax -8 -n- -dr 2 -ac yellowed.pdf -o out.png

PS. Gave you (Tex2002ans) a shout out on my PDF Conversion web page.

j.p.s · 02-29-2020, 07:19 PM

Quote:

Originally Posted by willus

I am realizing that in the original post for this thread, there were two different versions of the PDF--the black and white one (which I processed with my original post in this thread), and the yellowed version. To process the yellowed version, I needed to jack up the contrast quite a bit. I've tweaked this set of options to give one output page per input page--simply cropping each input page:

k2pdfopt -mode tm -c- -rt 0 -cmax -8 -bpc 2 -n- -dr 2 -ac yellowed.pdf -o processed.pdf

The "-mode tm" does the one page per page bit. "-cmax -8" to use a fixed contrast adjust on each page (the default auto adjust isn't aggressive enough for this document--you could probably stand to go even higher than 8). The "-dr 2" bumps up the output resolution by a factor of two over what it would normally be for a Kindle Paperwhite. Example input and output attached.

One form of archive.org book files have a mask image for each page that can be used to make white regions of the page completely white. Have you ever used those to help clean up the page?

willus · 03-01-2020, 12:57 PM

Quote:

Originally Posted by j.p.s

One form of archive.org book files have a mask image for each page that can be used to make white regions of the page completely white. Have you ever used those to help clean up the page?

Interesting. No--I had not heard of this before.

j.p.s · 03-01-2020, 01:35 PM

Quote:

Originally Posted by j.p.s

One form of archive.org book files have a mask image for each page that can be used to make white regions of the page completely white. Have you ever used those to help clean up the page?

Quote:

Originally Posted by willus

Interesting. No--I had not heard of this before.

It's been a couple of years since I've worked with them, so I'm fuzzy on the details. I mentioned it in passing in post #3 in the thread
https://www.mobileread.com/forums/sh...d.php?t=178155
Some scripts working with the mask images are in the first attachment.

The images (of pages of text) in at least some archive.org PDF files are combinations of 2 PBM images and a PGM image. One of the PBM images is the mask. I discovered this when I ran a utility to extract images from a PDF and have no idea how the PDF standard addresses this or how PDF libraries and utilities make use of the mask images.

Pajamaman · 03-05-2020, 12:22 PM

Quote:

Originally Posted by Tex2002ans

You have to dewarp the images. Scan Tailor Advanced can do that.

Convert the PDF into PNG or TIFF images, run Scan Tailor on them, then go back to PDF.

Gracias

This Scantaylor looks handy. Surprised I never heard of it before.

Tex2002ans · 03-05-2020, 04:05 PM

Quote:

Originally Posted by Pajamaman

This Scantaylor looks handy. Surprised I never heard of it before.

The DIY Book Scanner forum is where a lot of discussion/support happens.

They've been around for a long time, similar to MobileRead, and focus on a lot of book scanning/cleanup. Tons of good information there.

And here's the "Scan Tailor Advanced" topic where 4lex4 posts.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
archive.org downloads	abrogard	Calibre	2	08-11-2018 06:08 PM
Archive.org	crutledge	General Discussions	129	08-28-2015 06:22 AM
do you try to optimize for different devices?	sarah_pnix	ePub	5	02-16-2011 05:05 AM
PDFs are blank when dled from archive.org	rakista	enTourage Archive	1	05-16-2010 09:58 AM
Archive.org copyright question	Hatgirl	General Discussions	7	03-23-2010 07:58 PM