View Single Post
Old 11-17-2024, 10:03 PM   #8
DanCa
Member
DanCa began at the beginning.
 
Posts: 21
Karma: 10
Join Date: Sep 2013
Device: none
Has anyone found a solution to view pdf documents from archive.org?

I have no problems with other scanned pdfs that have 10x the size, but even 10MB archive pdfs are impossibly slow to render (> 30s/page).

I tried the dejazap.js tool from in the github issue mentioned by @MaxStirner, but it does nothing and doesn't even change the document size. I tried both Mask and SMask.
The original link is dead, but the file is probably this one: https://ghostscript.com/~tor/stuff/. I had to replace DeviceGray by mupdf.ColorSpace.DeviceGray to make it run. Script is attached below:

Spoiler:

This script is supposed to remove foreground and background images. Does not do anything for me.
Run with 'mutool run dejazap.js scourgeormonthly01crui.pdf out.pdf'
mutool comes with MuPDF.
Code:
// Extract the image masks from DjVu-like PDF files and create a new monochrome
// PDF from them.
//
// This assumes that each page consists of three full page images:
//   * A full color background image.
//   * A full color foreground image.
//   * A black and white selection mask.
//
// The background image typically holds the white page color, the foreground
// image holds the ink color, and the mask selects whether the foreground ink
// or background paper shows for a given pixel.
//
// This allows the background and foreground images to be encoded with an
// algorithm where the compressor can ignore the foreground ink pixels when
// compressing the background image, and vice versa, accomplishing much higher
// compression ratios since all the high-frequency data is moved to the
// selection mask which is compressed using a black&white algorithm.
//
// Typically these files are created with JPEG2000 compression for the full
// color images, which is very slow to decompress. The selection mask is then
// compressed with JBIG2 which is also quite slow.
//
// If we create a new PDF file containing only the selection masks drawn as
// monochrome images, we can usually render these files much faster, and they
// look nicer since the muddy colors are removed and the text is nice and
// crisp.
//
// There is of course the danger of losing actual color images in the file!

if (scriptArgs.length < 2) {
	print("usage: mutool run dejavu.js input.pdf output.pdf");
	quit();
}

var bgPix = new Pixmap(mupdf.ColorSpace.DeviceGray, [0,0,1,1], false);
var fgPix = new Pixmap(mupdf.ColorSpace.DeviceGray, [0,0,1,1], false);
bgPix.clear(0);
fgPix.clear(255);

var doc = new PDFDocument(scriptArgs[0]);
var bgImg = doc.addImage(new Image(bgPix));
for (var i = 0; i < doc.countPages(); ++i) {
	var page = doc.findPage(i);
	page.Resources.XObject.forEach(function (name, xobj) {
		// var mask = xobj.Mask;
                 var mask = xobj.SMask;
		if (mask) {
			var fgImg = doc.addImage(new Image(fgPix, doc.loadImage(mask)));
			page.Resources.XObject[name] = fgImg;
			
		} else {
			page.Resources.XObject[name] = bgImg;
		}
	});
}
doc.save(scriptArgs[1], "garbage=compact,compress");
DanCa is offline   Reply With Quote