![]() |
#1 |
Connoisseur
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 71
Karma: 18500
Join Date: Apr 2013
Device: Kindle Touch, Paperwhite
|
Koreader is poor in handling Internet Archive books
And it seems to be non device speciffic. Doesn't matter whether it is a flagship phone, Kobo Aura, Kindle pw 4 Hangs up or crashes pretty much everytime I open one of these books. Can anything be done about it?
Last edited by MaxStirner; 10-19-2022 at 11:58 AM. |
![]() |
![]() |
![]() |
#2 | |
cosiñeiro
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,406
Karma: 2451781
Join Date: Apr 2014
Device: BQ Cervantes 4
|
Quote:
In that case please put a link here pointing to one of the files that make the app hang/crash. Since you're talking about IA I'm assuming you're downloading books in the public domain. Most probably are broken documents but it is always interesting to learn from somebody else's errors ![]() |
|
![]() |
![]() |
Advert | |
|
![]() |
#3 |
Connoisseur
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 71
Karma: 18500
Join Date: Apr 2013
Device: Kindle Touch, Paperwhite
|
No I am thinking about pdfs And even if Koreader does not ultimately crash or hang up, it takes ages tobrender a page
|
![]() |
![]() |
![]() |
#4 |
Resident Curmudgeon
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 79,708
Karma: 145864619
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
|
|
![]() |
![]() |
![]() |
#5 |
Connoisseur
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 71
Karma: 18500
Join Date: Apr 2013
Device: Kindle Touch, Paperwhite
|
Ok, but then my reasoning is this - even if eink devices do not have the resources to manage such files (not enough ram, too weak processor etc), could the process be made somehow faster on other decices like tablets or phones? They have tons of memory and should be able to deal with that..
|
![]() |
![]() |
Advert | |
|
![]() |
#6 |
cosiñeiro
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,406
Karma: 2451781
Join Date: Apr 2014
Device: BQ Cervantes 4
|
Ok, so broken documents
Use any pdf reader based on Pdfium (like anything based on Chrome or mostly anything based on android). They can help as they spawn multiple threads to render a single document and can handle multiple documents spawing multiple processes (each one with multiple threads). That doesn't fix the nature of the documents. They will be still broken, will be still slow to navigate them or to jump pages. If you want to read them using KOReader your best bet is to convert them to djvu. Or just reprint them with ghostscript tweaking some parameters. Or maybe there's a tool that's able to fix utterly big images on them automagically or, at least, fix/convert the color space. |
![]() |
![]() |
![]() |
#7 |
Connoisseur
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 71
Karma: 18500
Join Date: Apr 2013
Device: Kindle Touch, Paperwhite
|
Yes, i did a quick search on Koreader issues, turns out someone has already noticed the problem, and has a possibble soulution, happy to see that I am not alone. To bad that it looks like the issue is frozen
https://github.com/koreader/koreader/issues/7992 Last edited by MaxStirner; 10-21-2022 at 01:56 PM. |
![]() |
![]() |
![]() |
#8 |
Member
![]() Posts: 21
Karma: 10
Join Date: Sep 2013
Device: none
|
Has anyone found a solution to view pdf documents from archive.org?
I have no problems with other scanned pdfs that have 10x the size, but even 10MB archive pdfs are impossibly slow to render (> 30s/page). I tried the dejazap.js tool from in the github issue mentioned by @MaxStirner, but it does nothing and doesn't even change the document size. I tried both Mask and SMask. The original link is dead, but the file is probably this one: https://ghostscript.com/~tor/stuff/. I had to replace DeviceGray by mupdf.ColorSpace.DeviceGray to make it run. Script is attached below: Spoiler:
|
![]() |
![]() |
![]() |
#9 |
Zealot
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 135
Karma: 33084
Join Date: Jan 2021
Device: Likebook Mars
|
I read lots of PDFs from archive.org. I always pre-process them using K2pdfopt (https://www.willus.com/k2pdfopt/). File size will usually be larger, but they will load quickly and in a format more friendly for ereader screens. There's a bit of a learning curve though to get optimal parameters (depends on source document, target device screen size, and your preferences).
|
![]() |
![]() |
![]() |
#10 |
Enthusiast
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 43
Karma: 14828
Join Date: Feb 2023
Device: Boox Page, Kobo Aura SE
|
I think the problem is related to image encoding of Archive.org PDF files. KOReader, or MuPdf, chokes on JPEG2000 encoded PDF files. Similar report from Sumatra: https://github.com/sumatrapdfreader/...df/issues/1922
One can export every page to PNG and then re-combine all the files to a PDF or use Finereader to accomplish this multi-step task. |
![]() |
![]() |
![]() |
#11 | ||
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,750
Karma: 730681
Join Date: Oct 2014
Location: Antwerp
Device: Kobo Aura H2O
|
Quote:
Quote:
|
||
![]() |
![]() |
![]() |
#12 |
Member
![]() Posts: 21
Karma: 10
Join Date: Sep 2013
Device: none
|
So here are my results so far:
- I have not managed to get anything useful out of the mutools script. This seems to be the way to go, just replace the layer. pdfimages -list clearly shows the 3 layers, 2 rgb and one gray, but the script in the two versions does absolutely nothing. - Acrobat Pro does not detect any background images or layers These are the file sizes, render times for the different approaches: * Original file, 11MB, 1min/page or crash * Original file, printed through ClawPDF driver, 172 MB, 1s/page * Original file, printed through Windows save as pdf, 100MB, 1s/page * Original file ran through k2pdfopt, default options or with - colorbg ffffff (which doesn't do anything, 72 MB, 1s/page + nasty dot pattern * djvu pdf2djvu --monochrome, 52 MB, 1s/page, nasty dithering artifacts * djvu pdf2djvu, 15 MB, 3s/page * djvu pdf2djvu.com, 11 MB, 40s/page * mutools version, 11MB, 1min/page or crash * no djvu available on archive.org It's scary that a 172 MB pdf from ClawPDF renders much faster than the 11MB archive.org original. Obviously none of the methods above removed the background from the scan. I've found some brute force methods using ImageMagick [1], but that seems like the wrong approach. The pdf is already split into layers with OCR'ed text, it doesn't make sense to me to flatten the pdf, save each page as an individual image, use a tool to try to separate the background from the text, do OCR, and then put everything back together. All of this to deal with lousy archive.org pdfs. [1] https://old.reddit.com/r/kindlescrib...slg/?context=3 |
![]() |
![]() |
![]() |
#13 | |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,750
Karma: 730681
Join Date: Oct 2014
Location: Antwerp
Device: Kobo Aura H2O
|
Quote:
Swapping the colors in the .clear() commands and changing the mask to SMask takes care of that, and it is indeed a lot faster. Getting rid of the paper also improves render quality on eink. Code:
if (scriptArgs.length < 2) { print("usage: mutool run dejavu.js input.pdf output.pdf"); quit(); } var bgPix = new Pixmap(DeviceGray, [0,0,1,1], false); var fgPix = new Pixmap(DeviceGray, [0,0,1,1], false); bgPix.clear(255); fgPix.clear(0); var doc = new PDFDocument(scriptArgs[0]); var bgImg = doc.addImage(new Image(bgPix)); for (var i = 0; i < doc.countPages(); ++i) { var page = doc.findPage(i); page.Resources.XObject.forEach(function (name, xobj) { var mask = xobj.SMask; if (mask) { var fgImg = doc.addImage(new Image(fgPix, doc.loadImage(mask))); page.Resources.XObject[name] = fgImg; } else { page.Resources.XObject[name] = bgImg; } }); } doc.save(scriptArgs[1], "garbage=compact,compress"); |
|
![]() |
![]() |
![]() |
#14 |
Member
![]() Posts: 21
Karma: 10
Join Date: Sep 2013
Device: none
|
For me the converted file looks exactly the same and is 16 kB larger (out of 11MB).
Which version of mutools did you use? I tried 1.23.10+ds1-1build3 on linux and 1.23.0 on windows. They create slightly different versions with no visible difference. I had to replace DeviceGray with mupdf.ColorSpace.DeviceGray for both. Maybe my pdf is different, I downloaded it a long time ago, and can't find the original anymore. I'll see if I can find a publicly available pdf for comparison purposes. pdfimages shows the two images and the mask layer 1 0 image 1816 2925 rgb 3 8 jpx no 498 0 360 360 72.7K 0.5% 1 1 image 1816 2925 rgb 3 8 jpx no 500 0 360 360 8067B 0.1% 1 2 smask 1816 2925 gray 1 1 jbig2 no 500 0 360 360 47.3K 7.3% |
![]() |
![]() |
![]() |
#15 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,750
Karma: 730681
Join Date: Oct 2014
Location: Antwerp
Device: Kobo Aura H2O
|
The script I posted works in 1.21, the original version likely is intended for 1.19.
|
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Internet Archive | tubemonkey | Audiobook Discussions | 0 | 08-30-2014 02:27 PM |
Internet Archive preserves paper books | wallcraft | General Discussions | 24 | 06-18-2011 02:17 PM |
Shortcovers (Kobo?) adds 1.8 million scanned books from The Internet Archive | anurag | News | 11 | 06-15-2011 06:15 AM |
ARTICLE: Internet Archive BookServer | ekaser | News | 3 | 10-20-2009 10:20 PM |
Images from Google Books, Internet Archive, etc. | vivaldirules | Upload Help | 18 | 09-17-2009 10:00 AM |