PDF compatibility

zuflacht · 03-04-2012, 09:02 PM

Hi,

As I am new to the forum: thanks for all the useful information here!

I am using my new Onyx mostly for reading pdf, so I was wondering if someone has an idea why some pdfs don't display on the Onyx like for instance this one: http://www.archive.org/details/suicidestudyinso00durk

Of course, I could sent it through some virtual pdf printer like my dear quarz (Mac user), but if I do, the document gains some 600GB against its original 10MB ... And no, I rather have the page layout than use an ebook format, that was the reason I got an M92 ...

Thanks for any help!
Best,
Johan

pidgeon92 · 03-04-2012, 11:43 PM

Try opening the PDF in Preview, and then saving it as a PDF. It shouldn't add any file size, and when you add the new file to the Boox, it should open correctly.

zuflacht · 03-05-2012, 04:32 AM

Thanks, but unfortunately, even if it shouldn't change the file size, it does: instead of 14 MB, I get 899 MB ...

PF4Mobile · 03-05-2012, 09:24 AM

wow

that is a really bad pdf
I tried a couple of tricks on it nothing worked
I had to give up (no time for now) ...will try again later

tuxor · 03-05-2012, 10:08 AM

This pdf consists of pictures with multiple layers. One layer for the text (~800KB per page if saved directly from Evince as png or jpg in decent quality), one for the background (and two additional layers I don't understand) - really good work has been done here in extracting the text layer after scanning - I wonder whether you could even remove the yellowish background layer without losing any quality of reading/information? Then of course you have the plain text information from OCR which doesn't amount to a significant part of the file size.

By the way, showing this pdf in Evince is really slow on my notebook with 2.4 GHz Core2Duo with 4GB RAM - so I'm not surprised it's kind of a challenge for the M92. Printing it with cups-pdf is slow and returns a file of appx 1GB (only printed the first 20 pages for testing) that doesn't really contain what you'd expect.

Extracting all images with command "pdfimages" yields 3 ppm files and 1 pbm file (image format with only two different colors) per page. All images together amount to more than 8GB (I estimate). If you only keep the pbm files, which contain the text information in appx 2000x3000 pixels, it's about 320 MB. Convert those pbm files to png and you have appx 30 KB per page, so all in all 30*400=12000KB=12MB for the text layer in the whole PDF extracted as PNG.

PF4Mobile · 03-05-2012, 10:43 AM

Adobe Acrobat doesn't see any layers there..are you sure those are layers?
I do see Objects overlayed in the Content pane

Edit: I tried to delete the image object underlying the text and the text disappeared.
the text object was there but there is something wrong with the font (not embedded?) or with the text cassette visibility ...it beats me what it is.

If you do not plan to copy text from this document just find the version without OCR
Actually M92 seems to have a problem with the image layers since all the pages seemed to be blank. Now I realize that the text must had been there but I could not see it.

The other way to solve the problem (if you insist to read the file in PDF) is to get the epub file from the same page and to transform into a PDF with calibre or something else

tuxor · 03-05-2012, 10:59 AM

Well, I don't even have Adobe Acrobat - I don't need it, it's too expensive and it doesn't run on Linux... ;-) I was just looking at what the command "pdfimages" returned and what I got when exporting images from inside the document with evince.

Unfortunately there are many pages with annotations in that document. They amount for more than 150KB each when exported as png. So unfortunately that's more than 60MB in the end when exported as png :-/

PF4Mobile · 03-05-2012, 11:08 AM

those commands seem to be misleading since the layers you mentioned seem not to be there. That unless Adobe Acrobat is wrong.
Other PDF viewers that I tried do not seem to see that either

Booxtor · 03-05-2012, 11:42 AM

I have tried to open that PDF document on all my PDF supporting ereaders (Pocketbook 903, Sony PRS650) they don't display this file properly either. It must be something special with the PDFs from those archive pages

tuxor · 03-05-2012, 11:46 AM

What I wanted to say is, that I have no idea of the whole pdf format at all. I don't know whether there are "layers" or anything like that at all in the pdf format. I was just playing around with some pdf tools and looking at the result...

However: maybe zuflacht can try this pdf on his M92, it's the book from the first post in a slightly different format (only first 30 pages and in png) and there's a small chance it might be displayed correctly on the M92:output.pdf

Beryll Snyder · 03-05-2012, 01:08 PM

It displays alright on my Nook classic, without the annotation and maps.
Funny formating though and a hodgepodge of fonts.

zuflacht · 03-05-2012, 04:40 PM

Thanks everyone, particular tuxor and eLiNK (by private msg.), those files work fine! It seems the png-version from tuxor has better contrast ...
I had the chance to check this pdf in Adobe Acrobat, it reported two images per page, one is the scan, the other has the "interpolate flag" set, so this is probably where the problem is. Would anyone know how to get rid of all those extra images (they also have smaller res) besides exporting and reimporting, i.e., some kind of batch process of preflight fix?
Thanks again!

tuxor · 03-05-2012, 05:21 PM

Okay, since the way I did it seems to work, I will also contribute the small bash script that I wrote to get the png-pdf-version:

Code:

#!/bin/bash
for i in {1..416}
do
   j=$(printf %03d $i)
   pdfimages -j -f $i -l $i $1 __tmpfile
   rm -f __tmpfile*.ppm
   convert -negate __tmpfile*.pbm __tmpimg$j.png
   rm -f __tmpfile*.pbm
   convert __tmpimg$j.png __tmpimg$j.pdf
   rm -f __tmpimg*.png
done
pdftk __tmpimg*.pdf cat output output.pdf
rm -f __tmpimg*.pdf

This script needs the path to the input pdf as argument and will write to "output.pdf" in the working directory. The final pdf will be appx 54MB and the procedure will take really long and use a lot of cpu power. The same script probably won't work with most other pdfs, but there's a good chance it will work with some of the pdfs on archive.org that stem from the same ocr software.

Unfortunately, if you are on windows, there is no way of using this script. But I uploaded the whole converted file and will send the link via pm on request.

FDD · 03-06-2012, 03:47 AM

Did anybody try the DjVu version of the file? It usually works better than PDF for scanned documents.

Beryll Snyder · 03-06-2012, 04:20 AM

Quote:

Originally Posted by FDD

Did anybody try the DjVu version of the file? It usually works better than PDF for scanned documents.

In a scientific context you need page numbers for quoting etc. ...

03-04-2012, 09:02 PM	#1
zuflacht Junior Member Posts: 5 Karma: 10 Join Date: Mar 2012 Device: onyx boox m92	PDF compatibility Hi, As I am new to the forum: thanks for all the useful information here! I am using my new Onyx mostly for reading pdf, so I was wondering if someone has an idea why some pdfs don't display on the Onyx like for instance this one: http://www.archive.org/details/suicidestudyinso00durk Of course, I could sent it through some virtual pdf printer like my dear quarz (Mac user), but if I do, the document gains some 600GB against its original 10MB ... And no, I rather have the page layout than use an ebook format, that was the reason I got an M92 ... Thanks for any help! Best, Johan

03-05-2012, 10:08 AM	#5
tuxor Addict Posts: 320 Karma: 99999 Join Date: Oct 2011 Location: Germany Device: Onyx Boox M92, Icarus Illumina E653	This pdf consists of pictures with multiple layers. One layer for the text (~800KB per page if saved directly from Evince as png or jpg in decent quality), one for the background (and two additional layers I don't understand) - really good work has been done here in extracting the text layer after scanning - I wonder whether you could even remove the yellowish background layer without losing any quality of reading/information? Then of course you have the plain text information from OCR which doesn't amount to a significant part of the file size. By the way, showing this pdf in Evince is really slow on my notebook with 2.4 GHz Core2Duo with 4GB RAM - so I'm not surprised it's kind of a challenge for the M92. Printing it with cups-pdf is slow and returns a file of appx 1GB (only printed the first 20 pages for testing) that doesn't really contain what you'd expect. Extracting all images with command "pdfimages" yields 3 ppm files and 1 pbm file (image format with only two different colors) per page. All images together amount to more than 8GB (I estimate). If you only keep the pbm files, which contain the text information in appx 2000x3000 pixels, it's about 320 MB. Convert those pbm files to png and you have appx 30 KB per page, so all in all 30400=12000KB=12MB for the text layer in the whole PDF extracted as PNG. Last edited by tuxor; 03-05-2012 at 10:10 AM.*

03-05-2012, 10:43 AM	#6
PF4Mobile Guru Posts: 629 Karma: 3526 Join Date: Jun 2011 Device: Kobo Touch, Nook Touch, EEE 800 Note, Entourage PE, finally M92	Adobe Acrobat doesn't see any layers there..are you sure those are layers? I do see Objects overlayed in the Content pane Edit: I tried to delete the image object underlying the text and the text disappeared. the text object was there but there is something wrong with the font (not embedded?) or with the text cassette visibility ...it beats me what it is. If you do not plan to copy text from this document just find the version without OCR Actually M92 seems to have a problem with the image layers since all the pages seemed to be blank. Now I realize that the text must had been there but I could not see it. The other way to solve the problem (if you insist to read the file in PDF) is to get the epub file from the same page and to transform into a PDF with calibre or something else Last edited by PF4Mobile; 03-05-2012 at 10:51 AM.

03-05-2012, 11:46 AM	#10
tuxor Addict Posts: 320 Karma: 99999 Join Date: Oct 2011 Location: Germany Device: Onyx Boox M92, Icarus Illumina E653	What I wanted to say is, that I have no idea of the whole pdf format at all. I don't know whether there are "layers" or anything like that at all in the pdf format. I was just playing around with some pdf tools and looking at the result... However: maybe zuflacht can try this pdf on his M92, it's the book from the first post in a slightly different format (only first 30 pages and in png) and there's a small chance it might be displayed correctly on the M92:output.pdf Last edited by tuxor; 03-05-2012 at 11:55 AM.

03-05-2012, 05:21 PM	#13
tuxor Addict Posts: 320 Karma: 99999 Join Date: Oct 2011 Location: Germany Device: Onyx Boox M92, Icarus Illumina E653	Okay, since the way I did it seems to work, I will also contribute the small bash script that I wrote to get the png-pdf-version: Code: #!/bin/bash for i in {1..416} do j=$(printf %03d $i) pdfimages -j -f $i -l $i $1 __tmpfile rm -f __tmpfile.ppm convert -negate __tmpfile.pbm __tmpimg$j.png rm -f __tmpfile.pbm convert __tmpimg$j.png __tmpimg$j.pdf rm -f __tmpimg.png done pdftk __tmpimg.pdf cat output output.pdf rm -f __tmpimg.pdf This script needs the path to the input pdf as argument and will write to "output.pdf" in the working directory. The final pdf will be appx 54MB and the procedure will take really long and use a lot of cpu power. The same script probably won't work with most other pdfs, but there's a good chance it will work with some of the pdfs on archive.org that stem from the same ocr software. Unfortunately, if you are on windows, there is no way of using this script. But I uploaded the whole converted file and will send the link via pm on request.

03-04-2012, 11:43 PM	#2
pidgeon92 Wizard Posts: 3,144 Karma: 8426142 Join Date: Jun 2008 Location: Chicago, IL Device: Kindle PW2, Kindle Voyage, Kindle DXG, Boox M90, Kobo Aura HD	Try opening the PDF in Preview, and then saving it as a PDF. It shouldn't add any file size, and when you add the new file to the Boox, it should open correctly.

03-05-2012, 04:32 AM	#3
zuflacht Junior Member Posts: 5 Karma: 10 Join Date: Mar 2012 Device: onyx boox m92	Thanks, but unfortunately, even if it shouldn't change the file size, it does: instead of 14 MB, I get 899 MB ...

03-05-2012, 09:24 AM	#4
PF4Mobile Guru Posts: 629 Karma: 3526 Join Date: Jun 2011 Device: Kobo Touch, Nook Touch, EEE 800 Note, Entourage PE, finally M92	wow that is a really bad pdf I tried a couple of tricks on it nothing worked I had to give up (no time for now) ...will try again later

03-05-2012, 10:59 AM	#7
tuxor Addict Posts: 320 Karma: 99999 Join Date: Oct 2011 Location: Germany Device: Onyx Boox M92, Icarus Illumina E653	Well, I don't even have Adobe Acrobat - I don't need it, it's too expensive and it doesn't run on Linux... ;-) I was just looking at what the command "pdfimages" returned and what I got when exporting images from inside the document with evince. Unfortunately there are many pages with annotations in that document. They amount for more than 150KB each when exported as png. So unfortunately that's more than 60MB in the end when exported as png :-/

03-05-2012, 11:08 AM	#8
PF4Mobile Guru Posts: 629 Karma: 3526 Join Date: Jun 2011 Device: Kobo Touch, Nook Touch, EEE 800 Note, Entourage PE, finally M92	those commands seem to be misleading since the layers you mentioned seem not to be there. That unless Adobe Acrobat is wrong. Other PDF viewers that I tried do not seem to see that either

03-05-2012, 11:42 AM	#9
Booxtor Booxtor Posts: 1,126 Karma: 2305664 Join Date: Jun 2011 Location: Germany Device: a lot of..	I have tried to open that PDF document on all my PDF supporting ereaders (Pocketbook 903, Sony PRS650) they don't display this file properly either. It must be something special with the PDFs from those archive pages

03-05-2012, 01:08 PM	#11
Beryll Snyder Banned Posts: 356 Karma: 60546 Join Date: Oct 2010 Device: Nook classic, PB 903, Onyx M92	It displays alright on my Nook classic, without the annotation and maps. Funny formating though and a hodgepodge of fonts.

03-05-2012, 04:40 PM	#12
zuflacht Junior Member Posts: 5 Karma: 10 Join Date: Mar 2012 Device: onyx boox m92	Thanks everyone, particular tuxor and eLiNK (by private msg.), those files work fine! It seems the png-version from tuxor has better contrast ... I had the chance to check this pdf in Adobe Acrobat, it reported two images per page, one is the scan, the other has the "interpolate flag" set, so this is probably where the problem is. Would anyone know how to get rid of all those extra images (they also have smaller res) besides exporting and reimporting, i.e., some kind of batch process of preflight fix? Thanks again!

03-06-2012, 03:47 AM	#14
FDD Connoisseur Posts: 62 Karma: 1114 Join Date: Jan 2012 Device: Onyx Boox M92	Did anybody try the DjVu version of the file? It usually works better than PDF for scanned documents.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Booken compatibility....	carpetmojo	Bookeen	4	12-26-2011 02:43 PM
Nook Color compatibility with PDF magazines	SteveV	Nook Color & Nook Tablet	8	01-25-2011 05:53 AM
Database compatibility	mwheinz	Calibre	5	11-08-2010 10:44 AM
Mobipocket compatibility	ckirchho	ePub	7	03-28-2009 12:26 PM
Compatibility?	Egghead	Sony Reader	4	06-16-2006 07:01 PM

Advert

Advert