Quote:
Originally Posted by sealbeater
Anyway, here's my point. If I have the images, why would I bother to OCR or covert them to text? I have the images. From what I understand, EPUB is just compressed HTML. Why couldn't I just strip the images and reference them in HTML and compress them?
|
You could. In fact, I did something similar in
this book that I included in the Mobileread library. Ereader software doesn't handle mixed Hebrew and English well, so I rendered the Hebrew as images. In the CSS, I linked the image size to the relative font size ("em") rather than a fixed size ("in" or "px") like so:
Code:
img.Hebrew
{
display:inline-block;
vertical-align:middle;
height:1.3em;
}
The images are then scaled with the font size.
Unfortunately, it doesn't work with all ereader software, including some that's popular (neither Coolreader nor Moon+ displays it how I intended). The only reason that I did it in the first place is that the various ereader applications are even less consistent about rendering Hebrew text than displaying images. Doing the same thing for English text sounds like an interesting exercise, but no easier or practical than any other means of dealing with a PDF.
If you're interested in PDF conversion/extraction as more than a thought experiment, you'll want the Adobe reference documents for both
PostScript and
PDF. The
PDF Toolkit can be used to "uncompress" a PDF and make it more readable, but it's cryptic even so. PDF can be converted to PostScript which is more readable, especially if you're trying to learn what's going on in a particular PDF. Just be aware that the conversion isn't always lossless (Ghostscript's "pdf2ps" and xpdf's "pdftops" don't preserve things like tables of contents, for example).
Ghostscript and
GSView will render both PostScript and PDF and have command consoles with decent error output so you can play around.