DNAML releases PDF to ePub - Page 6

John F · 09-08-2018, 05:36 AM

Quote:

Originally Posted by sealbeater

Read the thread. As anther person did, you are making the mistake of mixing my original words with my responses to others words.

If I say it would take me 20 minutes, I've already stated in the thread that that was hyperbole representing the effort needed.

When someone says:

I naturally responded that I don't have a couple of days worth of time in response to this.

Reading comprehension. It would serve you well.

Let's go back to your reply,

Quote:

Originally Posted by sealbeater

The suggestion to drop all of my current projects, including my day job, was the respondent's, not mine.

As for scripting it out in 20 minutes, I don't know *exactly* how long it would take for me to do it but I don't think it would be overly difficult. Take the 20 minute timeframe as an indicator of the relative difficulty of the problem, not an estimation of exact time spent.

Most of it would be generating the epub file and how best to output the pdf for cleaning.

Nowhere in there do you say it was hyperbole. So you say it was level of effort, and you don't know *exactly*. When you make an estimation "of effort" using "time", you should try to give some indication of how much you would be off. Pretty close estimation, than it would take 20 man minutes; of you are off by a factor of 2, than 40 man minutes; a man day would be off by a factor of 72. Don't blame other people for challenging you for your Humpty Dumpty definitions of "effort" and "time".

DuckieTigger · 09-08-2018, 06:16 AM

Quote:

Originally Posted by sealbeater

No assumptions being made, pdfs are either one or the other and I don't disagree, you would have to do a 2 stage run on the pdf to get the best automatic result. However, I've never seen a pdf that had both.

No they are not either or. Even the PDF that contains text has full page images. You simply create them by printing the PDF into individual images for each page. OCR has a better chance to succeed than possibly horribly garbled text inside that won't tell you where the header is, for example.

Agama · 09-08-2018, 06:22 AM

When I said a couple of days I was just trying to be generous and not hold Sealbleater to a literal 20 mins. I would be impressed if this work could be achieved in 2 days, never mind 20 minutes.

Agama · 09-08-2018, 12:54 PM

End of feeding: sorry Sealbleater, no more fish!

sealbeater · 09-08-2018, 03:10 PM

Quote:

Originally Posted by John F

Let's go back to your reply,

Nowhere in there do you say it was hyperbole.

I didn't think it was neccessary. I see I was mistaken.

Quote:

Originally Posted by John F

So you say it was level of effort, and you don't know *exactly*. When you make an estimation "of effort" using "time", you should try to give some indication of how much you would be off.

Ok...say...an hour.

Quote:

Originally Posted by John F

Pretty close estimation, than it would take 20 man minutes; of you are off by a factor of 2, than 40 man minutes; a man day would be off by a factor of 72. Don't blame other people for challenging you for your Humpty Dumpty definitions of "effort" and "time".

LOL. Whatever you have to tell yourself champ.

sealbeater · 09-08-2018, 03:11 PM

Quote:

Originally Posted by DuckieTigger

No they are not either or. Even the PDF that contains text has full page images. You simply create them by printing the PDF into individual images for each page. OCR has a better chance to succeed than possibly horribly garbled text inside that won't tell you where the header is, for example.

Yes, they are. I have never seen a PDF that contains both txt and full page images. Please, attach one so we can view it.

OCR is the last thing you want, not the first.

sealbeater · 09-08-2018, 03:11 PM

Quote:

Originally Posted by Agama

End of feeding: sorry Sealbleater, no more fish!

Its sad that humans always have to try to dehumanize others.

j.p.s · 09-08-2018, 05:51 PM

Quote:

Originally Posted by DuckieTigger

No they are not either or. Even the PDF that contains text has full page images. You simply create them by printing the PDF into individual images for each page. OCR has a better chance to succeed than possibly horribly garbled text inside that won't tell you where the header is, for example.

PDF is based on the postscript programming language.

PDF documents that have text that can be copy and pasted have an added text layer that is not used to render or print a page. The text on any given page in a PDF document might be rendered on the spot or might be part of a pixel based image. The source of the text layer may be generated from the source text or from OCR of a pixel based image. Lots of strange errors that are not in the rendered page are evidence that the text layer is OCR based. I don't know whether any application uses the location information in the text layer for anything other to enable highlighting, copying, and pasting. It would be neat if a PDF to text application could use the location information as formatting hints and not just extract the raw text.

There is no requirement that a text layer be present and there is no requirement that a PDF document have any pixel images at all or a single text character, and it can have any mixture of them.

Pixel images in a PDF can usually be extracted and might be JPEG, JPEG2000, PNG, TIFF, or addional image types. Some images in PDF documents are vector based and can be rendered quite large with high quality and might require very little storage space.

j.p.s · 09-08-2018, 06:06 PM

Quote:

Originally Posted by sealbeater

Yes, they are. I have never seen a PDF that contains both txt and full page images. Please, attach one so we can view it.

OCR is the last thing you want, not the first.

A large fraction of the PDF books at archive.org consist of high resolution (usually 300 dpi or 600 dpi) full page images with a text layer riddled with OCR errors. They are too large to attach, but anyone can download and view them, and they are examples of books with hundreds of pages, each of which every page is a full page image and most pages have text that can be extracted with the usual PDF to text tools and can be highlighted, copied, and pasted.

I have no idea of the total number of such books, but there are quite a few, and archive.org is not the only source of such books.

sealbeater · 09-08-2018, 06:39 PM

Quote:

Originally Posted by j.p.s

A large fraction of the PDF books at archive.org consist of high resolution (usually 300 dpi or 600 dpi) full page images with a text layer riddled with OCR errors.

I'll take a look but I've never yet run into one that had both.

DuckieTigger · 09-08-2018, 08:18 PM

Quote:

Originally Posted by j.p.s

PDF is based on the postscript programming language.

PDF documents that have text that can be copy and pasted have an added text layer that is not used to render or print a page. The text on any given page in a PDF document might be rendered on the spot or might be part of a pixel based image. The source of the text layer may be generated from the source text or from OCR of a pixel based image. Lots of strange errors that are not in the rendered page are evidence that the text layer is OCR based. I don't know whether any application uses the location information in the text layer for anything other to enable highlighting, copying, and pasting. It would be neat if a PDF to text application could use the location information as formatting hints and not just extract the raw text.

There is no requirement that a text layer be present and there is no requirement that a PDF document have any pixel images at all or a single text character, and it can have any mixture of them.

Pixel images in a PDF can usually be extracted and might be JPEG, JPEG2000, PNG, TIFF, or addional image types. Some images in PDF documents are vector based and can be rendered quite large with high quality and might require very little storage space.

Aye, and I didn't say that there is an embedded image file for the full page. Just that they have it. What I meant is that all the information is inside. I even mentioned that to extract the full page images you can simply print the PDF file. Printing will render out a bitmap at the specified resolution which can be redirected and converted into the correct input format for the OCR software.

murraypaul · 09-09-2018, 04:20 PM

[deleted]

Rand Brittain · 09-15-2018, 09:19 PM

What do people recommend these days to do smart extraction of the text of a non-scanned PDF into HTML or EPUB?

DNSB · 09-15-2018, 09:25 PM

Quote:

Originally Posted by Rand Brittain

What do people recommend these days to do smart extraction of the text of a non-scanned PDF into HTML or EPUB?

Personally, I've been using calibre with heuristics enabled and then editing the resulting epub in Sigil.

Difflugia · 09-17-2018, 02:31 PM

Quote:

Originally Posted by sealbeater

Yes, they are. I have never seen a PDF that contains both txt and full page images. Please, attach one so we can view it.

OCR is the last thing you want, not the first.

I've attached two-page excerpts from three commercial PDF books that I've bought. You can decide whether or not they invalidate what you've said. In case anyone cares, I used The PDF Toolkit to extract pages from the larger documents.

I'll note that PDF fonts are not fixed. For example, the first page of the "Text only.pdf" file that I linked contains the Greek phrase, ὁ υἱὸς τοῦ ἀνθρώπου. If I copy/paste that phrase, I get something far different: o" yi"oÁq toyÄ a! nurwpoy. That also happens in some English documents if the chosen font includes different glyphs for certain kerned pairs ("ff" is common). It's also possible to completely remap a font, either intentionally to hinder copy-paste or simply as a programming expedient. In those cases, OCR will give a much better result than simple text extraction. It's further possible to restore accurate copy/paste ability to such a document by adding the embedded text layer, even though there's already a "text" layer used to render the page.

09-08-2018, 12:54 PM	#79
Agama Guru Posts: 776 Karma: 2751519 Join Date: Jul 2010 Location: UK Device: PW2, Nexus7	End of feeding: sorry Sealbleater, no more fish! Last edited by Agama; 09-08-2018 at 01:27 PM.

09-09-2018, 04:20 PM	#87
murraypaul Interested Bystander Posts: 3,725 Karma: 19728152 Join Date: Jun 2008 Device: Note 4, Kobo One	[deleted] Last edited by murraypaul; 09-09-2018 at 08:09 PM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
PDF in epub?	Floeee	Software	3	10-20-2009 05:52 PM
PDFTOEPUB BY DNAML- WARNING	mets	News	0	09-21-2009 01:16 PM
Google releases 1 million public domain books in ePub format	joedevon	News	25	09-02-2009 05:13 PM

09-08-2018, 06:22 AM	#78
Agama Guru Posts: 776 Karma: 2751519 Join Date: Jul 2010 Location: UK Device: PW2, Nexus7	When I said a couple of days I was just trying to be generous and not hold Sealbleater to a literal 20 mins. I would be impressed if this work could be achieved in 2 days, never mind 20 minutes.

09-15-2018, 09:19 PM	#88
Rand Brittain Bookmaker Posts: 416 Karma: 2143650 Join Date: Sep 2010 Device: Cybook Opus	What do people recommend these days to do smart extraction of the text of a non-scanned PDF into HTML or EPUB?

Advert

Advert