Archive.org can't read any d/led PDF

rakista · 04-26-2010, 06:17 PM

Was going through some books on early american humanist movements and found I could not read them on the reader, with PDF to go just get endlessly red X's and on the ebook side I get endless blank pages.

What is going on ?

aidren · 04-26-2010, 08:47 PM

Rakista

I have a few of these. They are not readable on the eDGe unless they have a hidden text layer. Any that I have with the hidden text layer are fine... with this warning... if you export/transport them out of the library/internal hard drive, the reader software strips the hidden text. So, you have to make sure you keep your original copy elsewhere, and do not overwrite it.

As Boris said, many of the older books are image only. Sometimes it is because they have more than one language represented in them, or inline symbols. I have several like this.

You have to run an ocr on them to create the hidden text layer. You can do this in Acrobat Standard and up, although with the book that I did myself, I found that Acrobat didn't do an adequate job. It seemed to just run everything automatically. I used a scanning software that allowed me to train and check the ocr, as well as export it as a layered pdf. It worked much better than Acrobat. It was a lot of work, but I was interested in keeping the book as resource material.

I'm a Mac user, so the scanning software I used was Read Iris Pro. You'll have to ask the Windows users what scanning software is best. I think it might be Abbey Reader? Also, the ocr quality is somewhat dependent on the image quality. The higher the resolution, the sharper the edges and the ocr has a better chance of interpreting it.

Also, you will not be able to "reflow" the text.

dcubed2 · 04-26-2010, 08:58 PM

I can read most of the image pdfs from archive.org, but some are blank. From what I can tell, the ones with additional info (like author) embedded don't work. Also, like Boris said, the ones from Google don't show any images that might be there.

When I open one of the blank pdfs, I usually get the full title (not file name) and author in the top margin of the eInk side. There's no way to tell in advance which files will be blank that I've found.

You can try Project Gutenberg if you don't mind the ocr look. They do epub instead of pdf. Many Google books are also in epub format, all the epub files I've downloaded from both sites have worked. For my books, there are a lot of ocr errors and reading through them is a pain.

aidren · 04-26-2010, 09:44 PM

Quote:

...there are a lot of ocr errors and reading through them is a pain.

That is because whoever ocr'd them did it with automatic settings, or within Acrobat or some other such thing.

I have an image pdf from Google, but I ran the ocr myself, set it as a hidden layer, and it is fine... but, like I said, it required some work. Just to be clear, though, it is the image layer you are reading in these.

aidren · 04-26-2010, 09:47 PM

Quote:

Originally Posted by borisb

No word from enTourage what if anything they can do via a software update to the eReader.

This was one of the questions I asked tech support. This was the answer

Quote:

Currently exporting a pdf with the text layer intact is in the planning phases but it is not targeted towards an upcoming feature releases, I do expect this capability to appear at some point in the future.

aidren · 05-04-2010, 07:01 PM

Just to add a little more to the blank pdf problem — I was investigating one of these today. I think what may be happening relates to the image formats being used. I believe the eDGe only reads jpeg and png??? I tried to find a way to find out what type of image files were in the pdf I was looking at, but wasn't able to do it. I did find out that there was some type of compression on it that resulted in it not copying to indesign (message said I needed qt to view it because of compression?). So what I did was export a couple of images out of the pdf to jpeg, and rebuilt a page. That page viewed correctly on the eDGe.

I believe most of the automatic ocr setups use tiff because it's not lossy; or they give you a selection, but the default is set to tiff. So, I'm kind of thinking that's the problem. And, in addition, there's whatever compression type has been used. The pdf I was dealing with was output from ghostscript.

Hope it sheds a bit more light.

04-26-2010, 06:17 PM	#1
rakista Edge User	Archive.org can't read any d/led PDF Was going through some books on early american humanist movements and found I could not read them on the reader, with PDF to go just get endlessly red X's and on the ebook side I get endless blank pages. What is going on ?

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Archive.org opens huge ebook lending library	rogue_librarian	News	37	02-27-2011 08:16 AM
Archive.org copyright question	Hatgirl	General Discussions	7	03-23-2010 07:58 PM
Archive.org adds Mobi format for most of 1.8m books	Nate the great	News	2	12-11-2009 03:01 PM
Copyright of derivative works from archive.org?	etienne66	Writers' Corner	22	07-17-2009 08:22 AM

04-26-2010, 08:47 PM	#2
aidren Edge User	Rakista I have a few of these. They are not readable on the eDGe unless they have a hidden text layer. Any that I have with the hidden text layer are fine... with this warning... *if you export/transport them out of the library/internal hard drive, the reader software strips the hidden text.* So, you have to make sure you keep your original copy elsewhere, and do not overwrite it. As Boris said, many of the older books are image only. Sometimes it is because they have more than one language represented in them, or inline symbols. I have several like this. You have to run an ocr on them to create the hidden text layer. You can do this in Acrobat Standard and up, although with the book that I did myself, I found that Acrobat didn't do an adequate job. It seemed to just run everything automatically. I used a scanning software that allowed me to train and check the ocr, as well as export it as a layered pdf. It worked much better than Acrobat. It was a lot of work, but I was interested in keeping the book as resource material. I'm a Mac user, so the scanning software I used was Read Iris Pro. You'll have to ask the Windows users what scanning software is best. I think it might be Abbey Reader? Also, the ocr quality is somewhat dependent on the image quality. The higher the resolution, the sharper the edges and the ocr has a better chance of interpreting it. Also, you will not be able to "reflow" the text.

04-26-2010, 08:58 PM	#3
dcubed2 Edge User	I can read most of the image pdfs from archive.org, but some are blank. From what I can tell, the ones with additional info (like author) embedded don't work. Also, like Boris said, the ones from Google don't show any images that might be there. When I open one of the blank pdfs, I usually get the full title (not file name) and author in the top margin of the eInk side. There's no way to tell in advance which files will be blank that I've found. You can try Project Gutenberg if you don't mind the ocr look. They do epub instead of pdf. Many Google books are also in epub format, all the epub files I've downloaded from both sites have worked. For my books, there are a lot of ocr errors and reading through them is a pain.

05-04-2010, 07:01 PM	#6
aidren Edge User	Just to add a little more to the blank pdf problem — I was investigating one of these today. I think what may be happening relates to the image formats being used. I believe the eDGe only reads jpeg and png??? I tried to find a way to find out what type of image files were in the pdf I was looking at, but wasn't able to do it. I did find out that there was some type of compression on it that resulted in it not copying to indesign (message said I needed qt to view it because of compression?). So what I did was export a couple of images out of the pdf to jpeg, and rebuilt a page. That page viewed correctly on the eDGe. I believe most of the automatic ocr setups use tiff because it's not lossy; or they give you a selection, but the default is set to tiff. So, I'm kind of thinking that's the problem. And, in addition, there's whatever compression type has been used. The pdf I was dealing with was output from ghostscript. Hope it sheds a bit more light.

Advert

Advert