Koreader is poor in handling Internet Archive books

MaxStirner · 10-19-2022, 11:04 AM

And it seems to be non device speciffic. Doesn't matter whether it is a flagship phone, Kobo Aura, Kindle pw 4 Hangs up or crashes pretty much everytime I open one of these books. Can anything be done about it?

pazos · 10-20-2022, 07:05 AM

Quote:

Originally Posted by MaxStirner

And it seems to be non device speciffic. Doesn't matter whether it is a flagship phone, Kobo Aura, Kindle pw 4 Hangs up or crashes pretty much everytime I open one of these books. Can anything be done about it?

Internet Archive Books is not a mimetype the app understand. I'm assuming you're talking about epubs.

In that case please put a link here pointing to one of the files that make the app hang/crash. Since you're talking about IA I'm assuming you're downloading books in the public domain.

Most probably are broken documents but it is always interesting to learn from somebody else's errors

MaxStirner · 10-20-2022, 04:19 PM

No I am thinking about pdfs And even if Koreader does not ultimately crash or hang up, it takes ages tobrender a page

JSWolf · 10-20-2022, 04:20 PM

Quote:

Originally Posted by MaxStirner

No I am thinking about pdfs And even if Koreader does not ultimately crash or hang up, it takes ages tobrender a page

Aren't Internet Archive PDF just images? If so, then that's why they are so slow.

MaxStirner · 10-20-2022, 04:32 PM

Quote:

Originally Posted by JSWolf

Aren't Internet Archive PDF just images? If so, then that's why they are so slow.

Ok, but then my reasoning is this - even if eink devices do not have the resources to manage such files (not enough ram, too weak processor etc), could the process be made somehow faster on other decices like tablets or phones? They have tons of memory and should be able to deal with that..

pazos · 10-20-2022, 05:10 PM

Ok, so broken documents

Use any pdf reader based on Pdfium (like anything based on Chrome or mostly anything based on android). They can help as they spawn multiple threads to render a single document and can handle multiple documents spawing multiple processes (each one with multiple threads).

That doesn't fix the nature of the documents. They will be still broken, will be still slow to navigate them or to jump pages.

If you want to read them using KOReader your best bet is to convert them to djvu. Or just reprint them with ghostscript tweaking some parameters. Or maybe there's a tool that's able to fix utterly big images on them automagically or, at least, fix/convert the color space.

MaxStirner · 10-21-2022, 01:52 PM

Yes, i did a quick search on Koreader issues, turns out someone has already noticed the problem, and has a possibble soulution, happy to see that I am not alone. To bad that it looks like the issue is frozen
https://github.com/koreader/koreader/issues/7992

DanCa · 11-17-2024, 10:03 PM

Has anyone found a solution to view pdf documents from archive.org?

I have no problems with other scanned pdfs that have 10x the size, but even 10MB archive pdfs are impossibly slow to render (> 30s/page).

I tried the dejazap.js tool from in the github issue mentioned by @MaxStirner, but it does nothing and doesn't even change the document size. I tried both Mask and SMask.
The original link is dead, but the file is probably this one: https://ghostscript.com/~tor/stuff/. I had to replace DeviceGray by mupdf.ColorSpace.DeviceGray to make it run. Script is attached below:

Spoiler:

This script is supposed to remove foreground and background images. Does not do anything for me.
Run with 'mutool run dejazap.js scourgeormonthly01crui.pdf out.pdf'
mutool comes with MuPDF.

Code:

// Extract the image masks from DjVu-like PDF files and create a new monochrome
// PDF from them.
//
// This assumes that each page consists of three full page images:
//   * A full color background image.
//   * A full color foreground image.
//   * A black and white selection mask.
//
// The background image typically holds the white page color, the foreground
// image holds the ink color, and the mask selects whether the foreground ink
// or background paper shows for a given pixel.
//
// This allows the background and foreground images to be encoded with an
// algorithm where the compressor can ignore the foreground ink pixels when
// compressing the background image, and vice versa, accomplishing much higher
// compression ratios since all the high-frequency data is moved to the
// selection mask which is compressed using a black&white algorithm.
//
// Typically these files are created with JPEG2000 compression for the full
// color images, which is very slow to decompress. The selection mask is then
// compressed with JBIG2 which is also quite slow.
//
// If we create a new PDF file containing only the selection masks drawn as
// monochrome images, we can usually render these files much faster, and they
// look nicer since the muddy colors are removed and the text is nice and
// crisp.
//
// There is of course the danger of losing actual color images in the file!

if (scriptArgs.length < 2) {
	print("usage: mutool run dejavu.js input.pdf output.pdf");
	quit();
}

var bgPix = new Pixmap(mupdf.ColorSpace.DeviceGray, [0,0,1,1], false);
var fgPix = new Pixmap(mupdf.ColorSpace.DeviceGray, [0,0,1,1], false);
bgPix.clear(0);
fgPix.clear(255);

var doc = new PDFDocument(scriptArgs[0]);
var bgImg = doc.addImage(new Image(bgPix));
for (var i = 0; i < doc.countPages(); ++i) {
	var page = doc.findPage(i);
	page.Resources.XObject.forEach(function (name, xobj) {
		// var mask = xobj.Mask;
                 var mask = xobj.SMask;
		if (mask) {
			var fgImg = doc.addImage(new Image(fgPix, doc.loadImage(mask)));
			page.Resources.XObject[name] = fgImg;
			
		} else {
			page.Resources.XObject[name] = bgImg;
		}
	});
}
doc.save(scriptArgs[1], "garbage=compact,compress");

jonnyl · 11-18-2024, 07:39 AM

I read lots of PDFs from archive.org. I always pre-process them using K2pdfopt (https://www.willus.com/k2pdfopt/). File size will usually be larger, but they will load quickly and in a format more friendly for ereader screens. There's a bit of a learning curve though to get optimal parameters (depends on source document, target device screen size, and your preferences).

nezih · 11-18-2024, 09:22 AM

I think the problem is related to image encoding of Archive.org PDF files. KOReader, or MuPdf, chokes on JPEG2000 encoded PDF files. Similar report from Sumatra: https://github.com/sumatrapdfreader/...df/issues/1922

One can export every page to PNG and then re-combine all the files to a PDF or use Finereader to accomplish this multi-step task.

Frenzie · 11-18-2024, 01:25 PM

Quote:

Has anyone found a solution to view pdf documents from archive.org?

Where it's available, download the DjVu instead if you intend to use it on an ereader.

Quote:

I think the problem is related to image encoding of Archive.org PDF files. KOReader, or MuPdf, chokes on JPEG2000 encoded PDF files.

It's also simply a gigantic image.

DanCa · 11-18-2024, 08:38 PM

So here are my results so far:

- I have not managed to get anything useful out of the mutools script. This seems to be the way to go, just replace the layer. pdfimages -list clearly shows the 3 layers, 2 rgb and one gray, but the script in the two versions does absolutely nothing.

- Acrobat Pro does not detect any background images or layers

These are the file sizes, render times for the different approaches:

* Original file, 11MB, 1min/page or crash
* Original file, printed through ClawPDF driver, 172 MB, 1s/page
* Original file, printed through Windows save as pdf, 100MB, 1s/page
* Original file ran through k2pdfopt, default options or with - colorbg ffffff (which doesn't do anything, 72 MB, 1s/page + nasty dot pattern
* djvu pdf2djvu --monochrome, 52 MB, 1s/page, nasty dithering artifacts
* djvu pdf2djvu, 15 MB, 3s/page
* djvu pdf2djvu.com, 11 MB, 40s/page
* mutools version, 11MB, 1min/page or crash
* no djvu available on archive.org

It's scary that a 172 MB pdf from ClawPDF renders much faster than the 11MB archive.org original.

Obviously none of the methods above removed the background from the scan.
I've found some brute force methods using ImageMagick [1], but that seems like the wrong approach. The pdf is already split into layers with OCR'ed text, it doesn't make sense to me to flatten the pdf, save each page as an individual image, use a tool to try to separate the background from the text, do OCR, and then put everything back together. All of this to deal with lousy archive.org pdfs.

[1] https://old.reddit.com/r/kindlescrib...slg/?context=3

Frenzie · 11-20-2024, 06:56 AM

Quote:

but the script in the two versions does absolutely nothing.

Define nothing. It comes out black for me, which isn't nothing.

Swapping the colors in the .clear() commands and changing the mask to SMask takes care of that, and it is indeed a lot faster. Getting rid of the paper also improves render quality on eink.

Code:

if (scriptArgs.length < 2) {
	print("usage: mutool run dejavu.js input.pdf output.pdf");
	quit();
}

var bgPix = new Pixmap(DeviceGray, [0,0,1,1], false);
var fgPix = new Pixmap(DeviceGray, [0,0,1,1], false);
bgPix.clear(255);
fgPix.clear(0);

var doc = new PDFDocument(scriptArgs[0]);
var bgImg = doc.addImage(new Image(bgPix));
for (var i = 0; i < doc.countPages(); ++i) {
	var page = doc.findPage(i);
	page.Resources.XObject.forEach(function (name, xobj) {
		var mask = xobj.SMask;
		if (mask) {
			var fgImg = doc.addImage(new Image(fgPix, doc.loadImage(mask)));
			page.Resources.XObject[name] = fgImg;
		} else {
			page.Resources.XObject[name] = bgImg;
		}
	});
}
doc.save(scriptArgs[1], "garbage=compact,compress");

DanCa · 11-21-2024, 12:55 AM

Quote:

Originally Posted by Frenzie

Define nothing. It comes out black for me, which isn't nothing.

For me the converted file looks exactly the same and is 16 kB larger (out of 11MB).

Which version of mutools did you use?

I tried 1.23.10+ds1-1build3 on linux and 1.23.0 on windows. They create slightly different versions with no visible difference.

I had to replace DeviceGray with mupdf.ColorSpace.DeviceGray for both.

Maybe my pdf is different, I downloaded it a long time ago, and can't find the original anymore. I'll see if I can find a publicly available pdf for comparison purposes.

pdfimages shows the two images and the mask layer

1 0 image 1816 2925 rgb 3 8 jpx no 498 0 360 360 72.7K 0.5%
1 1 image 1816 2925 rgb 3 8 jpx no 500 0 360 360 8067B 0.1%
1 2 smask 1816 2925 gray 1 1 jbig2 no 500 0 360 360 47.3K 7.3%

Frenzie · 11-21-2024, 05:36 AM

The script I posted works in 1.21, the original version likely is intended for 1.19.

10-19-2022, 11:04 AM	#1
MaxStirner Connoisseur Posts: 72 Karma: 58454 Join Date: Apr 2013 Device: Kindle Touch, Paperwhite	Koreader is poor in handling Internet Archive books And it seems to be non device speciffic. Doesn't matter whether it is a flagship phone, Kobo Aura, Kindle pw 4 Hangs up or crashes pretty much everytime I open one of these books. Can anything be done about it? Last edited by MaxStirner; 10-19-2022 at 11:58 AM.

10-21-2022, 01:52 PM	#7
MaxStirner Connoisseur Posts: 72 Karma: 58454 Join Date: Apr 2013 Device: Kindle Touch, Paperwhite	Yes, i did a quick search on Koreader issues, turns out someone has already noticed the problem, and has a possibble soulution, happy to see that I am not alone. To bad that it looks like the issue is frozen https://github.com/koreader/koreader/issues/7992 Last edited by MaxStirner; 10-21-2022 at 01:56 PM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Internet Archive	tubemonkey	Audiobook Discussions	0	08-30-2014 02:27 PM
Internet Archive preserves paper books	wallcraft	General Discussions	24	06-18-2011 02:17 PM
Shortcovers (Kobo?) adds 1.8 million scanned books from The Internet Archive	anurag	News	11	06-15-2011 06:15 AM
ARTICLE: Internet Archive BookServer	ekaser	News	3	10-20-2009 10:20 PM
Images from Google Books, Internet Archive, etc.	vivaldirules	Upload Help	18	09-17-2009 10:00 AM

10-20-2022, 04:19 PM	#3
MaxStirner Connoisseur Posts: 72 Karma: 58454 Join Date: Apr 2013 Device: Kindle Touch, Paperwhite	No I am thinking about pdfs And even if Koreader does not ultimately crash or hang up, it takes ages tobrender a page

10-20-2022, 05:10 PM	#6
pazos cosiñeiro Posts: 1,409 Karma: 2451781 Join Date: Apr 2014 Device: BQ Cervantes 4	Ok, so broken documents Use any pdf reader based on Pdfium (like anything based on Chrome or mostly anything based on android). They can help as they spawn multiple threads to render a single document and can handle multiple documents spawing multiple processes (each one with multiple threads). That doesn't fix the nature of the documents. They will be still broken, will be still slow to navigate them or to jump pages. If you want to read them using KOReader your best bet is to convert them to djvu. Or just reprint them with ghostscript tweaking some parameters. Or maybe there's a tool that's able to fix utterly big images on them automagically or, at least, fix/convert the color space.

11-18-2024, 07:39 AM	#9
jonnyl Groupie Posts: 168 Karma: 53122 Join Date: Jan 2021 Device: Likebook Mars	I read lots of PDFs from archive.org. I always pre-process them using K2pdfopt (https://www.willus.com/k2pdfopt/). File size will usually be larger, but they will load quickly and in a format more friendly for ereader screens. There's a bit of a learning curve though to get optimal parameters (depends on source document, target device screen size, and your preferences).

11-18-2024, 09:22 AM	#10
nezih Connoisseur Posts: 50 Karma: 14828 Join Date: Feb 2023 Device: Boox Page, Kobo Aura SE	I think the problem is related to image encoding of Archive.org PDF files. KOReader, or MuPdf, chokes on JPEG2000 encoded PDF files. Similar report from Sumatra: https://github.com/sumatrapdfreader/...df/issues/1922 One can export every page to PNG and then re-combine all the files to a PDF or use Finereader to accomplish this multi-step task.

11-18-2024, 08:38 PM	#12
DanCa Member Posts: 21 Karma: 10 Join Date: Sep 2013 Device: none	So here are my results so far: - I have not managed to get anything useful out of the mutools script. This seems to be the way to go, just replace the layer. pdfimages -list clearly shows the 3 layers, 2 rgb and one gray, but the script in the two versions does absolutely nothing. - Acrobat Pro does not detect any background images or layers These are the file sizes, render times for the different approaches: * Original file, 11MB, 1min/page or crash * Original file, printed through ClawPDF driver, 172 MB, 1s/page * Original file, printed through Windows save as pdf, 100MB, 1s/page * Original file ran through k2pdfopt, default options or with - colorbg ffffff (which doesn't do anything, 72 MB, 1s/page + nasty dot pattern * djvu pdf2djvu --monochrome, 52 MB, 1s/page, nasty dithering artifacts * djvu pdf2djvu, 15 MB, 3s/page * djvu pdf2djvu.com, 11 MB, 40s/page * mutools version, 11MB, 1min/page or crash * no djvu available on archive.org It's scary that a 172 MB pdf from ClawPDF renders much faster than the 11MB archive.org original. Obviously none of the methods above removed the background from the scan. I've found some brute force methods using ImageMagick [1], but that seems like the wrong approach. The pdf is already split into layers with OCR'ed text, it doesn't make sense to me to flatten the pdf, save each page as an individual image, use a tool to try to separate the background from the text, do OCR, and then put everything back together. All of this to deal with lousy archive.org pdfs. [1] https://old.reddit.com/r/kindlescrib...slg/?context=3

11-21-2024, 05:36 AM	#15
Frenzie Wizard Posts: 1,821 Karma: 731691 Join Date: Oct 2014 Location: Antwerp Device: Kobo Aura H2O, Kobo Libra 2	The script I posted works in 1.21, the original version likely is intended for 1.19.

Advert

Advert