View Single Post
Old 10-08-2018, 03:29 PM   #94
Difflugia
Testate Amoeba
Difflugia ought to be getting tired of karma fortunes by now.Difflugia ought to be getting tired of karma fortunes by now.Difflugia ought to be getting tired of karma fortunes by now.Difflugia ought to be getting tired of karma fortunes by now.Difflugia ought to be getting tired of karma fortunes by now.Difflugia ought to be getting tired of karma fortunes by now.Difflugia ought to be getting tired of karma fortunes by now.Difflugia ought to be getting tired of karma fortunes by now.Difflugia ought to be getting tired of karma fortunes by now.Difflugia ought to be getting tired of karma fortunes by now.Difflugia ought to be getting tired of karma fortunes by now.
 
Difflugia's Avatar
 
Posts: 3,049
Karma: 27300000
Join Date: Sep 2012
Device: Many Android devices, Kindle 2, Toshiba e755 PocketPC
Quote:
Originally Posted by sealbeater View Post
Anyway, here's my point. If I have the images, why would I bother to OCR or covert them to text? I have the images. From what I understand, EPUB is just compressed HTML. Why couldn't I just strip the images and reference them in HTML and compress them?
You could. In fact, I did something similar in this book that I included in the Mobileread library. Ereader software doesn't handle mixed Hebrew and English well, so I rendered the Hebrew as images. In the CSS, I linked the image size to the relative font size ("em") rather than a fixed size ("in" or "px") like so:

Code:
img.Hebrew
{
    display:inline-block;
    vertical-align:middle;
    height:1.3em;
}
The images are then scaled with the font size.

Unfortunately, it doesn't work with all ereader software, including some that's popular (neither Coolreader nor Moon+ displays it how I intended). The only reason that I did it in the first place is that the various ereader applications are even less consistent about rendering Hebrew text than displaying images. Doing the same thing for English text sounds like an interesting exercise, but no easier or practical than any other means of dealing with a PDF.

If you're interested in PDF conversion/extraction as more than a thought experiment, you'll want the Adobe reference documents for both PostScript and PDF. The PDF Toolkit can be used to "uncompress" a PDF and make it more readable, but it's cryptic even so. PDF can be converted to PostScript which is more readable, especially if you're trying to learn what's going on in a particular PDF. Just be aware that the conversion isn't always lossless (Ghostscript's "pdf2ps" and xpdf's "pdftops" don't preserve things like tables of contents, for example). Ghostscript and GSView will render both PostScript and PDF and have command consoles with decent error output so you can play around.
Difflugia is offline   Reply With Quote