View Full Version : PDF compatibility


zuflacht
03-04-2012, 09:02 PM
Hi,

As I am new to the forum: thanks for all the useful information here!

I am using my new Onyx mostly for reading pdf, so I was wondering if someone has an idea why some pdfs don't display on the Onyx like for instance this one: http://www.archive.org/details/suicidestudyinso00durk

Of course, I could sent it through some virtual pdf printer like my dear quarz (Mac user), but if I do, the document gains some 600GB against its original 10MB ... And no, I rather have the page layout than use an ebook format, that was the reason I got an M92 ...

Thanks for any help!
Best,
Johan

pidgeon92
03-04-2012, 11:43 PM
Try opening the PDF in Preview, and then saving it as a PDF. It shouldn't add any file size, and when you add the new file to the Boox, it should open correctly.

zuflacht
03-05-2012, 04:32 AM
Thanks, but unfortunately, even if it shouldn't change the file size, it does: instead of 14 MB, I get 899 MB ...

PF4Mobile
03-05-2012, 09:24 AM
wow

that is a really bad pdf
I tried a couple of tricks on it nothing worked
I had to give up (no time for now) ...will try again later

tuxor
03-05-2012, 10:08 AM
This pdf consists of pictures with multiple layers. One layer for the text (~800KB per page if saved directly from Evince as png or jpg in decent quality), one for the background (and two additional layers I don't understand) - really good work has been done here in extracting the text layer after scanning - I wonder whether you could even remove the yellowish background layer without losing any quality of reading/information? Then of course you have the plain text information from OCR which doesn't amount to a significant part of the file size.

By the way, showing this pdf in Evince is really slow on my notebook with 2.4 GHz Core2Duo with 4GB RAM - so I'm not surprised it's kind of a challenge for the M92. Printing it with cups-pdf is slow and returns a file of appx 1GB (only printed the first 20 pages for testing) that doesn't really contain what you'd expect.

Extracting all images with command "pdfimages" yields 3 ppm files and 1 pbm file (image format with only two different colors) per page. All images together amount to more than 8GB (I estimate). If you only keep the pbm files, which contain the text information in appx 2000x3000 pixels, it's about 320 MB. Convert those pbm files to png and you have appx 30 KB per page, so all in all 30*400=12000KB=12MB for the text layer in the whole PDF extracted as PNG.

PF4Mobile
03-05-2012, 10:43 AM
Adobe Acrobat doesn't see any layers there..are you sure those are layers?
I do see Objects overlayed in the Content pane

Edit: I tried to delete the image object underlying the text and the text disappeared.
the text object was there but there is something wrong with the font (not embedded?) or with the text cassette visibility ...it beats me what it is.

If you do not plan to copy text from this document just find the version without OCR
Actually M92 seems to have a problem with the image layers since all the pages seemed to be blank. Now I realize that the text must had been there but I could not see it.

The other way to solve the problem (if you insist to read the file in PDF) is to get the epub file from the same page and to transform into a PDF with calibre or something else

tuxor
03-05-2012, 10:59 AM
Well, I don't even have Adobe Acrobat - I don't need it, it's too expensive and it doesn't run on Linux... ;-) I was just looking at what the command "pdfimages" returned and what I got when exporting images from inside the document with evince.

Unfortunately there are many pages with annotations in that document. They amount for more than 150KB each when exported as png. So unfortunately that's more than 60MB in the end when exported as png :-/

PF4Mobile
03-05-2012, 11:08 AM
those commands seem to be misleading since the layers you mentioned seem not to be there. That unless Adobe Acrobat is wrong.
Other PDF viewers that I tried do not seem to see that either

Booxtor
03-05-2012, 11:42 AM
I have tried to open that PDF document on all my PDF supporting ereaders (Pocketbook 903, Sony PRS650) they don't display this file properly either. It must be something special with the PDFs from those archive pages :(

tuxor
03-05-2012, 11:46 AM
What I wanted to say is, that I have no idea of the whole pdf format at all. I don't know whether there are "layers" or anything like that at all in the pdf format. I was just playing around with some pdf tools and looking at the result...

However: maybe zuflacht can try this pdf on his M92, it's the book from the first post in a slightly different format (only first 30 pages and in png) and there's a small chance it might be displayed correctly on the M92:83486

Beryll Snyder
03-05-2012, 01:08 PM
It displays alright on my Nook classic, without the annotation and maps.
Funny formating though and a hodgepodge of fonts.

zuflacht
03-05-2012, 04:40 PM
Thanks everyone, particular tuxor and eLiNK (by private msg.), those files work fine! It seems the png-version from tuxor has better contrast ...
I had the chance to check this pdf in Adobe Acrobat, it reported two images per page, one is the scan, the other has the "interpolate flag" set, so this is probably where the problem is. Would anyone know how to get rid of all those extra images (they also have smaller res) besides exporting and reimporting, i.e., some kind of batch process of preflight fix?
Thanks again!

tuxor
03-05-2012, 05:21 PM
Okay, since the way I did it seems to work, I will also contribute the small bash script that I wrote to get the png-pdf-version:
#!/bin/bash
for i in {1..416}
do
j=$(printf %03d $i)
pdfimages -j -f $i -l $i $1 __tmpfile
rm -f __tmpfile*.ppm
convert -negate __tmpfile*.pbm __tmpimg$j.png
rm -f __tmpfile*.pbm
convert __tmpimg$j.png __tmpimg$j.pdf
rm -f __tmpimg*.png
done
pdftk __tmpimg*.pdf cat output output.pdf
rm -f __tmpimg*.pdf
This script needs the path to the input pdf as argument and will write to "output.pdf" in the working directory. The final pdf will be appx 54MB and the procedure will take really long and use a lot of cpu power. The same script probably won't work with most other pdfs, but there's a good chance it will work with some of the pdfs on archive.org that stem from the same ocr software.

Unfortunately, if you are on windows, there is no way of using this script. But I uploaded the whole converted file and will send the link via pm on request.

FDD
03-06-2012, 03:47 AM
Did anybody try the DjVu version of the file? It usually works better than PDF for scanned documents.

Beryll Snyder
03-06-2012, 04:20 AM
Did anybody try the DjVu version of the file? It usually works better than PDF for scanned documents.

In a scientific context you need page numbers for quoting etc. ...

tuxor
03-06-2012, 10:21 AM
How about a sticky about PDFs that don't work? At the moment I don't have a special one in mind, but I can imagine there will be more in the future...

zuflacht
03-07-2012, 05:59 PM
Well, after all it seems that these additional images (smaller resolution and with the interpolate flag set to true) weren't at fault: I found a way to delete them all through preflight fixups, but the resulting file still couldn't be read by the M92.

I further discovered that this file displays fine if you choose a zoom level of 100% or 125% (quite small, covering less than half the screen), but as soon as it is 150% or more (or fit screen), the page is white ...

Still not sure what exactly the issue is, as soon as I have time, I will try and fiddle with the OCR and other parts of the file layout. It seems related to how the file is rendered at different zoom levels, but it would be nice to know if the m92 or the file is to blame ...

zuflacht
03-07-2012, 07:13 PM
After some more tests, it seems likely that the m92's pdf reader has a problem with transparency in this pdf (this seems the way scanned image and ocr are integrated in this case) - is anyone able to confirm this?