MobileRead Forums - View Single Post - Koreader is poor in handling Internet Archive books

DanCa · 11-18-2024, 08:38 PM

So here are my results so far:

- I have not managed to get anything useful out of the mutools script. This seems to be the way to go, just replace the layer. pdfimages -list clearly shows the 3 layers, 2 rgb and one gray, but the script in the two versions does absolutely nothing.

- Acrobat Pro does not detect any background images or layers

These are the file sizes, render times for the different approaches:

* Original file, 11MB, 1min/page or crash
* Original file, printed through ClawPDF driver, 172 MB, 1s/page
* Original file, printed through Windows save as pdf, 100MB, 1s/page
* Original file ran through k2pdfopt, default options or with - colorbg ffffff (which doesn't do anything, 72 MB, 1s/page + nasty dot pattern
* djvu pdf2djvu --monochrome, 52 MB, 1s/page, nasty dithering artifacts
* djvu pdf2djvu, 15 MB, 3s/page
* djvu pdf2djvu.com, 11 MB, 40s/page
* mutools version, 11MB, 1min/page or crash
* no djvu available on archive.org

It's scary that a 172 MB pdf from ClawPDF renders much faster than the 11MB archive.org original.

Obviously none of the methods above removed the background from the scan.
I've found some brute force methods using ImageMagick [1], but that seems like the wrong approach. The pdf is already split into layers with OCR'ed text, it doesn't make sense to me to flatten the pdf, save each page as an individual image, use a tool to try to separate the background from the text, do OCR, and then put everything back together. All of this to deal with lousy archive.org pdfs.

[1] https://old.reddit.com/r/kindlescrib...slg/?context=3

11-18-2024, 08:38 PM	#12
DanCa Member Posts: 21 Karma: 10 Join Date: Sep 2013 Device: none	So here are my results so far: - I have not managed to get anything useful out of the mutools script. This seems to be the way to go, just replace the layer. pdfimages -list clearly shows the 3 layers, 2 rgb and one gray, but the script in the two versions does absolutely nothing. - Acrobat Pro does not detect any background images or layers These are the file sizes, render times for the different approaches: * Original file, 11MB, 1min/page or crash * Original file, printed through ClawPDF driver, 172 MB, 1s/page * Original file, printed through Windows save as pdf, 100MB, 1s/page * Original file ran through k2pdfopt, default options or with - colorbg ffffff (which doesn't do anything, 72 MB, 1s/page + nasty dot pattern * djvu pdf2djvu --monochrome, 52 MB, 1s/page, nasty dithering artifacts * djvu pdf2djvu, 15 MB, 3s/page * djvu pdf2djvu.com, 11 MB, 40s/page * mutools version, 11MB, 1min/page or crash * no djvu available on archive.org It's scary that a 172 MB pdf from ClawPDF renders much faster than the 11MB archive.org original. Obviously none of the methods above removed the background from the scan. I've found some brute force methods using ImageMagick [1], but that seems like the wrong approach. The pdf is already split into layers with OCR'ed text, it doesn't make sense to me to flatten the pdf, save each page as an individual image, use a tool to try to separate the background from the text, do OCR, and then put everything back together. All of this to deal with lousy archive.org pdfs. [1] https://old.reddit.com/r/kindlescrib...slg/?context=3