View Single Post
Old 11-14-2014, 07:45 PM   #8
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by MaudlinHaus View Post
You make some good points, but you make a bunch of incorrect statements about PDF. If you generate a PDF from InDesign, you get fully searchable, highlightable, copyable text--OCR is not in the picture at all.
You would think... you would think. PDF is quite a complex file type, and the way that many programs piece things together on the backend causes a heck of a lot of headaches. A lot of this is also VERY dependent on the settings that were actually used to create the PDF.

But get down into the nitty gritty, and things get UGLY. For example, ligatures might disappear in the text backend, characters with symbols 'ñ' might just show up as 'n'. (In the printed PDF though, you can see the little tilde + ligatures, but in the actual text backend, nope).

Then a lot of metadata can be tossed out the window, things such as footnotes/sidebars/headers/footers/captions, might not be marked as such. The PDF knows the LOCATION of this text, and it knows exactly where to plop them when you are printing/displaying it in a PDF Reader, but it doesn't know WHAT they are (this is extremely important when making an ebook).

You can pull the PLAIN TEXT out very easily (although, no formatting). But formatting is EXTREMELY important to the look of the book.

Then if you look at the actual code, oh boy. Using something like xpdf or poppler might get you this:

Spoiler:
Quote:
[{"top":599,"left":60,"width":32,"height":15,"font" :2,"data":"Boer"},{"top":599,"left":92,"width":4," height":15,"font":2,"data":" "},{"top":599,"left":95,"width":28,"height":15,"fo nt":2,"data":"War"},{"top":599,"left":124,"width": 4,"height":15,"font":2,"data":" "},{"top":599,"left":127,"width":54,"height":15,"f ont":2,"data":"Veteran"},{"top":599,"left":181,"wi dth":4,"height":15,"font":2,"data":" "},{"top":599,"left":185,"width":41,"height":15,"f ont":2,"data":"Status"},{"top":620,"left":60,"widt h":53,"height":15,"font":2,"data":"Thomas"},{"top" :620,"left":113,"width":4,"height":15,"font":2,"da ta":" "},{"top":620,"left":117,"width":59,"height":15,"f ont":2,"data":"returned"},{"top":620,"left":176,"w idth":4,"height":15,"font":2,"data":" "},{"top":620,"left":180,"width":14,"height":15,"f ont":2,"data":"to"},{"top":620,"left":194,"width": 4,"height":15,"font":2,"data":" "},{"top":620,"left":197,"width":59,"height":15,"f ont":2,"data":"Adelaide"},{"top":620,"left":257,"w idth":4,"height":15,"font":2,"data":" "},{"top":620,"left":260,"width":14,"height":15,"f ont":2,"data":"as"},{"top":620,"left":275,"width": 4,"height":15,"font":2,"data":" "},{"top":620,"left":278,"width":8,"height":15,"fo nt":2,"data":"a"},{"top":620,"left":286,"width":4, "height":15,"font":2,"data":" "},{"top":620,"left":290,"width":63,"height":15,"f ont":2,"data":"wounded"},{"top":620,"left":353,"wi dth":4,"height":15,"font":2,"data":" "},{"top":620,"left":356,"width":68,"height":15,"f ont":2,"data":"decorated"},{"top":620,"left":425," width":4,"height":15,"font":2,"data":" "},{"top":620,"left":428,"width":32,"height":15,"f ont":2,"data":"Boer"},{"top":620,"left":460,"width ":4,"height":15,"font":2,"data":" "},{"top":620,"left":464,"width":28,"height":15,"f ont":2,"data":"War"},{"top":620,"left":492,"width" :4,"height":15,"font":2,"data":" "},{"top":620,"left":496,"width":54,"height":15,"f ont":2,"data":"Veteran"},{"top":620,"left":549,"wi dth":4,"height":15,"font":2,"data":"


That is just to display the words "Boer War Veteran Status Thomas returned Adelaide wounded decorated War Veteran". The way that PDF works is that it places words in EXACT positions. This is why you have things like "heuristics" in Calibre, to try to GUESS what goes where, and what goes in what logical order, what font that was supposed to be, was it supposed to be bold/italics, what is a paragraph. (Again, heuristics are going to get a lot of things wrong, lots of errors introduced).

Any way you slice it, to pull out formatted text from a PDF, big waste of time (which is why in most cases, it is easier/faster to just re-OCR the entire thing).

Perhaps you have more knowledge of tools though. If so, teach me, I would LOVE to be able to pull out data from PDFs much more efficiently! It would be AMAZING. And then get people to start doing the PDF workflow that actually allows this to be possible!

And the original InDesign/Quark files ALREADY have all that nice formatting information just sitting in there, so if you export directly from there, that will be MUCH cleaner than trying to work backwards from the PDF.

Similar thing with images from PDF, now I am saying, that not all PDF -> XYZ format WILL just pull out the image losslessly. Again, it all depends on how the conversion place you send it to does it. Perhaps they do it right, but in my experience, I have not seen that. Perhaps you have had better luck and sent it to a place that does it properly.

Which is why I settled on the method, you just send me the original images separately, and I can work from that. No need to go through some hideous PDF middleman.

Quote:
Originally Posted by MaudlinHaus View Post
And if you export from inDesign with no downsampling, a 3-inch, 300dpi image in indesign is a 3-inch, 300dpi image in PDF--there's no loss there. And as I said, from the PDF source file, using CSS/html to squish 2x pixels into a given screen pixel space (a 400 px wide image gets screen pixel width="200" for example), we get high quality results on the ipad screen (I'm a little afraid of what is going on on the various HD Kindles, but I like the iPad better as a high-density screen standard.)
Hmmm, again, any examples?

Now, if you don't like the specific downsampling on the devices, then the only possible solution is to downsample them using an outside program, and inserting the lower resolution image in the file.

For example, this downsampling talk reminded me a lot of what GrannyGrump does with high-quality line-drawings scanned from older books. For example:

https://www.mobileread.com/forums/sho...15#post2682815

These specific types of drawings downscale HORRIBLY due to the downscaling algorithm on most devices. So the best bet would be to manually downscale using other tools, which might have more efficient/better algorithms for dealing with lines (Photoshop, GIMP, etc. etc.). So maybe you just pick a decent size resolution, like 1024x1024.

Quote:
Originally Posted by MaudlinHaus View Post
The issue I'm seeing, as far as I know I can tell is that by allowing the reader software (which is essentially a browser) to resize with HTML/CSS, you can end up tasking the software with rerendering a bunch of pixels into a really small space, so whereas I'm squishing a 400 px image into 200 screen pixels in iBooks and that looks good because the screen resolution is pretty high, it ends up looking pretty bad in ADE on an older monitor due to extreme downsampling and lower pixel per inch.
What other devices have you tested on, any eink devices? If you have a problem with those screens, you are probably going to have a heart attack if you saw it on an older EPUB reader.

Again, any sort of examples would be helpful. I personally haven't seen any sorts of images that are TOO bad (besides images with text).

Last edited by Tex2002ans; 11-14-2014 at 07:59 PM.
Tex2002ans is offline   Reply With Quote