Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book General > News

Notices

Reply
 
Thread Tools Search this Thread
Old 10-06-2018, 08:51 PM   #91
sealbeater
Banned
sealbeater ought to be getting tired of karma fortunes by now.sealbeater ought to be getting tired of karma fortunes by now.sealbeater ought to be getting tired of karma fortunes by now.sealbeater ought to be getting tired of karma fortunes by now.sealbeater ought to be getting tired of karma fortunes by now.sealbeater ought to be getting tired of karma fortunes by now.sealbeater ought to be getting tired of karma fortunes by now.sealbeater ought to be getting tired of karma fortunes by now.sealbeater ought to be getting tired of karma fortunes by now.sealbeater ought to be getting tired of karma fortunes by now.sealbeater ought to be getting tired of karma fortunes by now.
 
Posts: 666
Karma: 1752814
Join Date: Jan 2008
Device: Sony Reader PRS-505 : Onyx Boox Max : Sony PRS-900 : Onyx Kepler Pro
Quote:
Originally Posted by Difflugia View Post
I've attached two-page excerpts from three commercial PDF books that I've bought. You can decide whether or not they invalidate what you've said. In case anyone cares, I used The PDF Toolkit to extract pages from the larger documents.

I'll note that PDF fonts are not fixed. For example, the first page of the "Text only.pdf" file that I linked contains the Greek phrase, ὁ υἱὸς τοῦ ἀνθρώπου. If I copy/paste that phrase, I get something far different: o" yi"oÁq toyÄ a! nurwpoy. That also happens in some English documents if the chosen font includes different glyphs for certain kerned pairs ("ff" is common). It's also possible to completely remap a font, either intentionally to hinder copy-paste or simply as a programming expedient. In those cases, OCR will give a much better result than simple text extraction. It's further possible to restore accurate copy/paste ability to such a document by adding the embedded text layer, even though there's already a "text" layer used to render the page.

Sorry for taking so long to respond.

I found your pdf samples very interesting. I've never before seen a pdf with both images and txt in the wild. Interestingly, my normal go to "pdfimages", didn't work on any of them. It was only when I extracted to xml using pdftohtml that I thought any of them had images at all.


Anyway, here's my point. If I have the images, why would I bother to OCR or covert them to text? I have the images. From what I understand, EPUB is just compressed HTML. Why couldn't I just strip the images and reference them in HTML and compress them?
sealbeater is offline   Reply With Quote
Old 10-07-2018, 07:58 PM   #92
shalym
Wizard
shalym ought to be getting tired of karma fortunes by now.shalym ought to be getting tired of karma fortunes by now.shalym ought to be getting tired of karma fortunes by now.shalym ought to be getting tired of karma fortunes by now.shalym ought to be getting tired of karma fortunes by now.shalym ought to be getting tired of karma fortunes by now.shalym ought to be getting tired of karma fortunes by now.shalym ought to be getting tired of karma fortunes by now.shalym ought to be getting tired of karma fortunes by now.shalym ought to be getting tired of karma fortunes by now.shalym ought to be getting tired of karma fortunes by now.
 
shalym's Avatar
 
Posts: 3,032
Karma: 52740263
Join Date: Feb 2012
Location: New England
Device: PW 1, 2, 3, Voyage, Oasis 2 & 3, Fires, Aura HD, iPad
Quote:
Originally Posted by sealbeater View Post
Sorry for taking so long to respond.

I found your pdf samples very interesting. I've never before seen a pdf with both images and txt in the wild. Interestingly, my normal go to "pdfimages", didn't work on any of them. It was only when I extracted to xml using pdftohtml that I thought any of them had images at all.


Anyway, here's my point. If I have the images, why would I bother to OCR or covert them to text? I have the images. From what I understand, EPUB is just compressed HTML. Why couldn't I just strip the images and reference them in HTML and compress them?
You could...but then you couldn't change the font, or the font size, or use any of the other functions of epub. In other words, you may as well just leave it in pdf format.

Shari
shalym is offline   Reply With Quote
Old 10-08-2018, 09:07 AM   #93
Vroni
Banned
Vroni knows the difference between 'who' and 'whom'Vroni knows the difference between 'who' and 'whom'Vroni knows the difference between 'who' and 'whom'Vroni knows the difference between 'who' and 'whom'Vroni knows the difference between 'who' and 'whom'Vroni knows the difference between 'who' and 'whom'Vroni knows the difference between 'who' and 'whom'Vroni knows the difference between 'who' and 'whom'Vroni knows the difference between 'who' and 'whom'Vroni knows the difference between 'who' and 'whom'Vroni knows the difference between 'who' and 'whom'
 
Posts: 168
Karma: 10010
Join Date: Oct 2018
Device: Tolino/PRS 650/Tablet
Quote:
Originally Posted by sealbeater View Post
Anything that can be done manually can be scripted.
Well, not at all. Or better said, not yet.

If you want to decide if a number in a text is a left over page number or anything else which belongs to the text, you need contextual information. Just because it is a number you cant just delete it. may be its page number which needs to go away. May be a paragraph ends with that page number and the next paragraph has to start on its own. May be the page number dissipated a paragraph and after removing the page number the two objects have to be joined to one paragraph. Or its not a page number, it might be a year, a month, an age or whatever.

I really would like to see a script which can makes such decisions on its own with an accuracy of lets say 95%.

And this is only one of many issues you have when to try to make a gut epub out of a pdf conversion.

As Darryl already mentioned: i've the same impression that you don't have any glue what pdf is. Its not a markup language. It does not differ between text in bold and text in bold which is a headline.

Quote:
Originally Posted by sealbeater View Post
EPUB is just compressed HTML
It isnt. There are some files around. It is XHTML. And it allows only a subset of CSS 2.1. Which makes it more complicated.

Last edited by Vroni; 10-09-2018 at 04:31 AM. Reason: typos
Vroni is offline   Reply With Quote
Old 10-08-2018, 03:29 PM   #94
Difflugia
Testate Amoeba
Difflugia ought to be getting tired of karma fortunes by now.Difflugia ought to be getting tired of karma fortunes by now.Difflugia ought to be getting tired of karma fortunes by now.Difflugia ought to be getting tired of karma fortunes by now.Difflugia ought to be getting tired of karma fortunes by now.Difflugia ought to be getting tired of karma fortunes by now.Difflugia ought to be getting tired of karma fortunes by now.Difflugia ought to be getting tired of karma fortunes by now.Difflugia ought to be getting tired of karma fortunes by now.Difflugia ought to be getting tired of karma fortunes by now.Difflugia ought to be getting tired of karma fortunes by now.
 
Difflugia's Avatar
 
Posts: 3,049
Karma: 27300000
Join Date: Sep 2012
Device: Many Android devices, Kindle 2, Toshiba e755 PocketPC
Quote:
Originally Posted by sealbeater View Post
Anyway, here's my point. If I have the images, why would I bother to OCR or covert them to text? I have the images. From what I understand, EPUB is just compressed HTML. Why couldn't I just strip the images and reference them in HTML and compress them?
You could. In fact, I did something similar in this book that I included in the Mobileread library. Ereader software doesn't handle mixed Hebrew and English well, so I rendered the Hebrew as images. In the CSS, I linked the image size to the relative font size ("em") rather than a fixed size ("in" or "px") like so:

Code:
img.Hebrew
{
    display:inline-block;
    vertical-align:middle;
    height:1.3em;
}
The images are then scaled with the font size.

Unfortunately, it doesn't work with all ereader software, including some that's popular (neither Coolreader nor Moon+ displays it how I intended). The only reason that I did it in the first place is that the various ereader applications are even less consistent about rendering Hebrew text than displaying images. Doing the same thing for English text sounds like an interesting exercise, but no easier or practical than any other means of dealing with a PDF.

If you're interested in PDF conversion/extraction as more than a thought experiment, you'll want the Adobe reference documents for both PostScript and PDF. The PDF Toolkit can be used to "uncompress" a PDF and make it more readable, but it's cryptic even so. PDF can be converted to PostScript which is more readable, especially if you're trying to learn what's going on in a particular PDF. Just be aware that the conversion isn't always lossless (Ghostscript's "pdf2ps" and xpdf's "pdftops" don't preserve things like tables of contents, for example). Ghostscript and GSView will render both PostScript and PDF and have command consoles with decent error output so you can play around.
Difflugia is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
PDF in epub? Floeee Software 3 10-20-2009 05:52 PM
PDFTOEPUB BY DNAML- WARNING mets News 0 09-21-2009 01:16 PM
Google releases 1 million public domain books in ePub format joedevon News 25 09-02-2009 05:13 PM


All times are GMT -4. The time now is 01:30 PM.


MobileRead.com is a privately owned, operated and funded community.