Has anyone found a solution to view pdf documents from archive.org?
I have no problems with other scanned pdfs that have 10x the size, but even 10MB archive pdfs are impossibly slow to render (> 30s/page).
I tried the dejazap.js tool from in the github issue mentioned by @MaxStirner, but it does nothing and doesn't even change the document size. I tried both Mask and SMask.
The original link is dead, but the file is probably this one:
https://ghostscript.com/~tor/stuff/. I had to replace DeviceGray by mupdf.ColorSpace.DeviceGray to make it run. Script is attached below:
Spoiler:
This script is supposed to remove foreground and background images. Does not do anything for me.
Run with 'mutool run dejazap.js scourgeormonthly01crui.pdf out.pdf'
mutool comes with MuPDF.
Code:
// Extract the image masks from DjVu-like PDF files and create a new monochrome
// PDF from them.
//
// This assumes that each page consists of three full page images:
// * A full color background image.
// * A full color foreground image.
// * A black and white selection mask.
//
// The background image typically holds the white page color, the foreground
// image holds the ink color, and the mask selects whether the foreground ink
// or background paper shows for a given pixel.
//
// This allows the background and foreground images to be encoded with an
// algorithm where the compressor can ignore the foreground ink pixels when
// compressing the background image, and vice versa, accomplishing much higher
// compression ratios since all the high-frequency data is moved to the
// selection mask which is compressed using a black&white algorithm.
//
// Typically these files are created with JPEG2000 compression for the full
// color images, which is very slow to decompress. The selection mask is then
// compressed with JBIG2 which is also quite slow.
//
// If we create a new PDF file containing only the selection masks drawn as
// monochrome images, we can usually render these files much faster, and they
// look nicer since the muddy colors are removed and the text is nice and
// crisp.
//
// There is of course the danger of losing actual color images in the file!
if (scriptArgs.length < 2) {
print("usage: mutool run dejavu.js input.pdf output.pdf");
quit();
}
var bgPix = new Pixmap(mupdf.ColorSpace.DeviceGray, [0,0,1,1], false);
var fgPix = new Pixmap(mupdf.ColorSpace.DeviceGray, [0,0,1,1], false);
bgPix.clear(0);
fgPix.clear(255);
var doc = new PDFDocument(scriptArgs[0]);
var bgImg = doc.addImage(new Image(bgPix));
for (var i = 0; i < doc.countPages(); ++i) {
var page = doc.findPage(i);
page.Resources.XObject.forEach(function (name, xobj) {
// var mask = xobj.Mask;
var mask = xobj.SMask;
if (mask) {
var fgImg = doc.addImage(new Image(fgPix, doc.loadImage(mask)));
page.Resources.XObject[name] = fgImg;
} else {
page.Resources.XObject[name] = bgImg;
}
});
}
doc.save(scriptArgs[1], "garbage=compact,compress");