10-06-2018, 08:51 PM | #91 | |
Banned
Posts: 666
Karma: 1752814
Join Date: Jan 2008
Device: Sony Reader PRS-505 : Onyx Boox Max : Sony PRS-900 : Onyx Kepler Pro
|
Quote:
Sorry for taking so long to respond. I found your pdf samples very interesting. I've never before seen a pdf with both images and txt in the wild. Interestingly, my normal go to "pdfimages", didn't work on any of them. It was only when I extracted to xml using pdftohtml that I thought any of them had images at all. Anyway, here's my point. If I have the images, why would I bother to OCR or covert them to text? I have the images. From what I understand, EPUB is just compressed HTML. Why couldn't I just strip the images and reference them in HTML and compress them? |
|
10-07-2018, 07:58 PM | #92 | |
Wizard
Posts: 3,032
Karma: 52740263
Join Date: Feb 2012
Location: New England
Device: PW 1, 2, 3, Voyage, Oasis 2 & 3, Fires, Aura HD, iPad
|
Quote:
Shari |
|
10-08-2018, 09:07 AM | #93 |
Banned
Posts: 168
Karma: 10010
Join Date: Oct 2018
Device: Tolino/PRS 650/Tablet
|
Well, not at all. Or better said, not yet.
If you want to decide if a number in a text is a left over page number or anything else which belongs to the text, you need contextual information. Just because it is a number you cant just delete it. may be its page number which needs to go away. May be a paragraph ends with that page number and the next paragraph has to start on its own. May be the page number dissipated a paragraph and after removing the page number the two objects have to be joined to one paragraph. Or its not a page number, it might be a year, a month, an age or whatever. I really would like to see a script which can makes such decisions on its own with an accuracy of lets say 95%. And this is only one of many issues you have when to try to make a gut epub out of a pdf conversion. As Darryl already mentioned: i've the same impression that you don't have any glue what pdf is. Its not a markup language. It does not differ between text in bold and text in bold which is a headline. It isnt. There are some files around. It is XHTML. And it allows only a subset of CSS 2.1. Which makes it more complicated. Last edited by Vroni; 10-09-2018 at 04:31 AM. Reason: typos |
10-08-2018, 03:29 PM | #94 | |
Testate Amoeba
Posts: 3,049
Karma: 27300000
Join Date: Sep 2012
Device: Many Android devices, Kindle 2, Toshiba e755 PocketPC
|
Quote:
Code:
img.Hebrew { display:inline-block; vertical-align:middle; height:1.3em; } Unfortunately, it doesn't work with all ereader software, including some that's popular (neither Coolreader nor Moon+ displays it how I intended). The only reason that I did it in the first place is that the various ereader applications are even less consistent about rendering Hebrew text than displaying images. Doing the same thing for English text sounds like an interesting exercise, but no easier or practical than any other means of dealing with a PDF. If you're interested in PDF conversion/extraction as more than a thought experiment, you'll want the Adobe reference documents for both PostScript and PDF. The PDF Toolkit can be used to "uncompress" a PDF and make it more readable, but it's cryptic even so. PDF can be converted to PostScript which is more readable, especially if you're trying to learn what's going on in a particular PDF. Just be aware that the conversion isn't always lossless (Ghostscript's "pdf2ps" and xpdf's "pdftops" don't preserve things like tables of contents, for example). Ghostscript and GSView will render both PostScript and PDF and have command consoles with decent error output so you can play around. |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
PDF in epub? | Floeee | Software | 3 | 10-20-2009 05:52 PM |
PDFTOEPUB BY DNAML- WARNING | mets | News | 0 | 09-21-2009 01:16 PM |
Google releases 1 million public domain books in ePub format | joedevon | News | 25 | 09-02-2009 05:13 PM |