MobileRead Forums - View Single Post - Optimize PDFs from archive.org for E-Ink devices

Tex2002ans · 02-28-2020, 05:33 PM

Quote:

Originally Posted by ctop

Thank you amazing work. This is now really a pleasure to read on my Ares.

No problem.

Quote:

Originally Posted by ctop

My takeaway is that it really pays to invest the time to use Scantailor. Especially the removal of the page headers is great.

It's pretty good.

Quote:

Originally Posted by ctop

Did you describe the regexes you are using somewhere?

No, not that I remember (probably a good post to stick on the blog. I'll add it to my notes.).

I haven't changed my "Finereader Regex" in years... it's really just taking ~12 of Finereader's inline styles:

Code:

<span style="font-style:italic;">
<span style="font-weight:bold;">
<span style="font-variant:small-caps;">
<span style="font-weight:bold;font-variant:small-caps;">
[...]

and changing to CSS:

Code:

<span class="italics">
<span class="bold">
<span class="smallcaps">
<span class="smallcaps"> (smallcaps + bold is always an OCR/formatting error)
[...]

Then I just do an extra step to convert to:

Code:

<i>
<b>

I discussed some more of the process in:

2017 "Converting PDF/HTML to ereader formats"
2016 "Delete paragraphs in scanned books (S & R with regexes)"

It gives me a very clean base to work from, and then I can focus on the actual formatting/markup issues.

Quote:

Originally Posted by ctop

Looking forward to your blog:-)

Me too, me too.

I need to kick my butt into gear and get that blog up and running.

(That's one of my new years resolutions!)

Then I could just say "Here's everything I ever wrote on ImageMagick".

In January, I used it to cleanup 3000+ pages of journal articles. There were scanning artifacts along the top/right edges, plus dirt/smudge in the bottom right corners:

Click image for larger version

Name: 2_4_3-11[Orig].png
Views: 486
Size: 97.1 KB
ID: 177427

So I used ImageMagick to:

1. Crop ### pixels from the edges.
2. Fill ### pixels with white.

Click image for larger version

Name: 2_4_3-11-crop.png
Views: 485
Size: 96.0 KB
ID: 177428

but as I said, every PDF is going to bring unique challenges... a handful of random pages turned out like this:

Click image for larger version

Name: 2_4_8-7[Orig].png
Views: 535
Size: 176.4 KB
ID: 177429

and required further intervention. (Open the actual image and take a look. The MobileRead thumbnail looks okay, but you'll see the innards are actually multiple overlapping transparent boxes.)

And other pages, the headers/footers were too close to the edge, so my solution disappeared text. Without comparing the before/after closely, I would've never known certain text was clipped/missing. Again, why I stress a GUI is helpful.

Side Note: Here was another ImageMagick thread from a few months ago:

Converting pdf to png images

where I showed how to remove speckles + crop (the original's white margins were absolutely immense).