DNAML releases PDF to ePub - Page 7

sealbeater · 10-06-2018, 08:51 PM

Quote:

Originally Posted by Difflugia

I've attached two-page excerpts from three commercial PDF books that I've bought. You can decide whether or not they invalidate what you've said. In case anyone cares, I used The PDF Toolkit to extract pages from the larger documents.

I'll note that PDF fonts are not fixed. For example, the first page of the "Text only.pdf" file that I linked contains the Greek phrase, ὁ υἱὸς τοῦ ἀνθρώπου. If I copy/paste that phrase, I get something far different: o" yi"oÁq toyÄ a! nurwpoy. That also happens in some English documents if the chosen font includes different glyphs for certain kerned pairs ("ff" is common). It's also possible to completely remap a font, either intentionally to hinder copy-paste or simply as a programming expedient. In those cases, OCR will give a much better result than simple text extraction. It's further possible to restore accurate copy/paste ability to such a document by adding the embedded text layer, even though there's already a "text" layer used to render the page.

Sorry for taking so long to respond.

I found your pdf samples very interesting. I've never before seen a pdf with both images and txt in the wild. Interestingly, my normal go to "pdfimages", didn't work on any of them. It was only when I extracted to xml using pdftohtml that I thought any of them had images at all.

Anyway, here's my point. If I have the images, why would I bother to OCR or covert them to text? I have the images. From what I understand, EPUB is just compressed HTML. Why couldn't I just strip the images and reference them in HTML and compress them?

shalym · 10-07-2018, 07:58 PM

Quote:

Originally Posted by sealbeater

Sorry for taking so long to respond.

I found your pdf samples very interesting. I've never before seen a pdf with both images and txt in the wild. Interestingly, my normal go to "pdfimages", didn't work on any of them. It was only when I extracted to xml using pdftohtml that I thought any of them had images at all.

Anyway, here's my point. If I have the images, why would I bother to OCR or covert them to text? I have the images. From what I understand, EPUB is just compressed HTML. Why couldn't I just strip the images and reference them in HTML and compress them?

You could...but then you couldn't change the font, or the font size, or use any of the other functions of epub. In other words, you may as well just leave it in pdf format.

Shari

Vroni · 10-08-2018, 09:07 AM

Quote:

Originally Posted by sealbeater

Anything that can be done manually can be scripted.

Well, not at all. Or better said, not yet.

If you want to decide if a number in a text is a left over page number or anything else which belongs to the text, you need contextual information. Just because it is a number you cant just delete it. may be its page number which needs to go away. May be a paragraph ends with that page number and the next paragraph has to start on its own. May be the page number dissipated a paragraph and after removing the page number the two objects have to be joined to one paragraph. Or its not a page number, it might be a year, a month, an age or whatever.

I really would like to see a script which can makes such decisions on its own with an accuracy of lets say 95%.

And this is only one of many issues you have when to try to make a gut epub out of a pdf conversion.

As Darryl already mentioned: i've the same impression that you don't have any glue what pdf is. Its not a markup language. It does not differ between text in bold and text in bold which is a headline.

Quote:

Originally Posted by sealbeater

EPUB is just compressed HTML

It isnt. There are some files around. It is XHTML. And it allows only a subset of CSS 2.1. Which makes it more complicated.

Difflugia · 10-08-2018, 03:29 PM

Quote:

Originally Posted by sealbeater

Anyway, here's my point. If I have the images, why would I bother to OCR or covert them to text? I have the images. From what I understand, EPUB is just compressed HTML. Why couldn't I just strip the images and reference them in HTML and compress them?

You could. In fact, I did something similar in this book that I included in the Mobileread library. Ereader software doesn't handle mixed Hebrew and English well, so I rendered the Hebrew as images. In the CSS, I linked the image size to the relative font size ("em") rather than a fixed size ("in" or "px") like so:

Code:

img.Hebrew
{
    display:inline-block;
    vertical-align:middle;
    height:1.3em;
}

The images are then scaled with the font size.

Unfortunately, it doesn't work with all ereader software, including some that's popular (neither Coolreader nor Moon+ displays it how I intended). The only reason that I did it in the first place is that the various ereader applications are even less consistent about rendering Hebrew text than displaying images. Doing the same thing for English text sounds like an interesting exercise, but no easier or practical than any other means of dealing with a PDF.

If you're interested in PDF conversion/extraction as more than a thought experiment, you'll want the Adobe reference documents for both PostScript and PDF. The PDF Toolkit can be used to "uncompress" a PDF and make it more readable, but it's cryptic even so. PDF can be converted to PostScript which is more readable, especially if you're trying to learn what's going on in a particular PDF. Just be aware that the conversion isn't always lossless (Ghostscript's "pdf2ps" and xpdf's "pdftops" don't preserve things like tables of contents, for example). Ghostscript and GSView will render both PostScript and PDF and have command consoles with decent error output so you can play around.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
PDF in epub?	Floeee	Software	3	10-20-2009 05:52 PM
PDFTOEPUB BY DNAML- WARNING	mets	News	0	09-21-2009 01:16 PM
Google releases 1 million public domain books in ePub format	joedevon	News	25	09-02-2009 05:13 PM