Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book General > News

Notices

Reply
 
Thread Tools Search this Thread
Old 09-08-2018, 05:36 AM   #76
John F
Grand Sorcerer
John F ought to be getting tired of karma fortunes by now.John F ought to be getting tired of karma fortunes by now.John F ought to be getting tired of karma fortunes by now.John F ought to be getting tired of karma fortunes by now.John F ought to be getting tired of karma fortunes by now.John F ought to be getting tired of karma fortunes by now.John F ought to be getting tired of karma fortunes by now.John F ought to be getting tired of karma fortunes by now.John F ought to be getting tired of karma fortunes by now.John F ought to be getting tired of karma fortunes by now.John F ought to be getting tired of karma fortunes by now.
 
Posts: 7,149
Karma: 63458771
Join Date: Feb 2009
Device: Kobo Glo HD
Quote:
Originally Posted by sealbeater View Post
Read the thread. As anther person did, you are making the mistake of mixing my original words with my responses to others words.

If I say it would take me 20 minutes, I've already stated in the thread that that was hyperbole representing the effort needed.

When someone says:



I naturally responded that I don't have a couple of days worth of time in response to this.


Reading comprehension. It would serve you well.
Let's go back to your reply,

Quote:
Originally Posted by sealbeater View Post
The suggestion to drop all of my current projects, including my day job, was the respondent's, not mine.

As for scripting it out in 20 minutes, I don't know *exactly* how long it would take for me to do it but I don't think it would be overly difficult. Take the 20 minute timeframe as an indicator of the relative difficulty of the problem, not an estimation of exact time spent.

Most of it would be generating the epub file and how best to output the pdf for cleaning.
Nowhere in there do you say it was hyperbole. So you say it was level of effort, and you don't know *exactly*. When you make an estimation "of effort" using "time", you should try to give some indication of how much you would be off. Pretty close estimation, than it would take 20 man minutes; of you are off by a factor of 2, than 40 man minutes; a man day would be off by a factor of 72. Don't blame other people for challenging you for your Humpty Dumpty definitions of "effort" and "time".
John F is offline   Reply With Quote
Old 09-08-2018, 06:16 AM   #77
DuckieTigger
Wizard
DuckieTigger ought to be getting tired of karma fortunes by now.DuckieTigger ought to be getting tired of karma fortunes by now.DuckieTigger ought to be getting tired of karma fortunes by now.DuckieTigger ought to be getting tired of karma fortunes by now.DuckieTigger ought to be getting tired of karma fortunes by now.DuckieTigger ought to be getting tired of karma fortunes by now.DuckieTigger ought to be getting tired of karma fortunes by now.DuckieTigger ought to be getting tired of karma fortunes by now.DuckieTigger ought to be getting tired of karma fortunes by now.DuckieTigger ought to be getting tired of karma fortunes by now.DuckieTigger ought to be getting tired of karma fortunes by now.
 
DuckieTigger's Avatar
 
Posts: 4,742
Karma: 246906703
Join Date: Dec 2011
Location: USA
Device: Oasis 3, Oasis 2, PW3, PW1, KT
Quote:
Originally Posted by sealbeater View Post
No assumptions being made, pdfs are either one or the other and I don't disagree, you would have to do a 2 stage run on the pdf to get the best automatic result. However, I've never seen a pdf that had both.
No they are not either or. Even the PDF that contains text has full page images. You simply create them by printing the PDF into individual images for each page. OCR has a better chance to succeed than possibly horribly garbled text inside that won't tell you where the header is, for example.

Last edited by DuckieTigger; 09-08-2018 at 06:19 AM.
DuckieTigger is offline   Reply With Quote
Old 09-08-2018, 06:22 AM   #78
Agama
Guru
Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.
 
Agama's Avatar
 
Posts: 776
Karma: 2751519
Join Date: Jul 2010
Location: UK
Device: PW2, Nexus7
When I said a couple of days I was just trying to be generous and not hold Sealbleater to a literal 20 mins. I would be impressed if this work could be achieved in 2 days, never mind 20 minutes.
Agama is offline   Reply With Quote
Old 09-08-2018, 12:54 PM   #79
Agama
Guru
Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.
 
Agama's Avatar
 
Posts: 776
Karma: 2751519
Join Date: Jul 2010
Location: UK
Device: PW2, Nexus7
End of feeding: sorry Sealbleater, no more fish!

Last edited by Agama; 09-08-2018 at 01:27 PM.
Agama is offline   Reply With Quote
Old 09-08-2018, 03:10 PM   #80
sealbeater
Banned
sealbeater ought to be getting tired of karma fortunes by now.sealbeater ought to be getting tired of karma fortunes by now.sealbeater ought to be getting tired of karma fortunes by now.sealbeater ought to be getting tired of karma fortunes by now.sealbeater ought to be getting tired of karma fortunes by now.sealbeater ought to be getting tired of karma fortunes by now.sealbeater ought to be getting tired of karma fortunes by now.sealbeater ought to be getting tired of karma fortunes by now.sealbeater ought to be getting tired of karma fortunes by now.sealbeater ought to be getting tired of karma fortunes by now.sealbeater ought to be getting tired of karma fortunes by now.
 
Posts: 666
Karma: 1752814
Join Date: Jan 2008
Device: Sony Reader PRS-505 : Onyx Boox Max : Sony PRS-900 : Onyx Kepler Pro
Quote:
Originally Posted by John F View Post
Let's go back to your reply,


Nowhere in there do you say it was hyperbole.
I didn't think it was neccessary. I see I was mistaken.

Quote:
Originally Posted by John F View Post
So you say it was level of effort, and you don't know *exactly*. When you make an estimation "of effort" using "time", you should try to give some indication of how much you would be off.
Ok...say...an hour.


Quote:
Originally Posted by John F View Post
Pretty close estimation, than it would take 20 man minutes; of you are off by a factor of 2, than 40 man minutes; a man day would be off by a factor of 72. Don't blame other people for challenging you for your Humpty Dumpty definitions of "effort" and "time".
LOL. Whatever you have to tell yourself champ.
sealbeater is offline   Reply With Quote
Old 09-08-2018, 03:11 PM   #81
sealbeater
Banned
sealbeater ought to be getting tired of karma fortunes by now.sealbeater ought to be getting tired of karma fortunes by now.sealbeater ought to be getting tired of karma fortunes by now.sealbeater ought to be getting tired of karma fortunes by now.sealbeater ought to be getting tired of karma fortunes by now.sealbeater ought to be getting tired of karma fortunes by now.sealbeater ought to be getting tired of karma fortunes by now.sealbeater ought to be getting tired of karma fortunes by now.sealbeater ought to be getting tired of karma fortunes by now.sealbeater ought to be getting tired of karma fortunes by now.sealbeater ought to be getting tired of karma fortunes by now.
 
Posts: 666
Karma: 1752814
Join Date: Jan 2008
Device: Sony Reader PRS-505 : Onyx Boox Max : Sony PRS-900 : Onyx Kepler Pro
Quote:
Originally Posted by DuckieTigger View Post
No they are not either or. Even the PDF that contains text has full page images. You simply create them by printing the PDF into individual images for each page. OCR has a better chance to succeed than possibly horribly garbled text inside that won't tell you where the header is, for example.
Yes, they are. I have never seen a PDF that contains both txt and full page images. Please, attach one so we can view it.

OCR is the last thing you want, not the first.
sealbeater is offline   Reply With Quote
Old 09-08-2018, 03:11 PM   #82
sealbeater
Banned
sealbeater ought to be getting tired of karma fortunes by now.sealbeater ought to be getting tired of karma fortunes by now.sealbeater ought to be getting tired of karma fortunes by now.sealbeater ought to be getting tired of karma fortunes by now.sealbeater ought to be getting tired of karma fortunes by now.sealbeater ought to be getting tired of karma fortunes by now.sealbeater ought to be getting tired of karma fortunes by now.sealbeater ought to be getting tired of karma fortunes by now.sealbeater ought to be getting tired of karma fortunes by now.sealbeater ought to be getting tired of karma fortunes by now.sealbeater ought to be getting tired of karma fortunes by now.
 
Posts: 666
Karma: 1752814
Join Date: Jan 2008
Device: Sony Reader PRS-505 : Onyx Boox Max : Sony PRS-900 : Onyx Kepler Pro
Quote:
Originally Posted by Agama View Post
End of feeding: sorry Sealbleater, no more fish!
Its sad that humans always have to try to dehumanize others.
sealbeater is offline   Reply With Quote
Old 09-08-2018, 05:51 PM   #83
j.p.s
Grand Sorcerer
j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.
 
Posts: 5,275
Karma: 98804578
Join Date: Apr 2011
Device: pb360
Quote:
Originally Posted by DuckieTigger View Post
No they are not either or. Even the PDF that contains text has full page images. You simply create them by printing the PDF into individual images for each page. OCR has a better chance to succeed than possibly horribly garbled text inside that won't tell you where the header is, for example.
PDF is based on the postscript programming language.

PDF documents that have text that can be copy and pasted have an added text layer that is not used to render or print a page. The text on any given page in a PDF document might be rendered on the spot or might be part of a pixel based image. The source of the text layer may be generated from the source text or from OCR of a pixel based image. Lots of strange errors that are not in the rendered page are evidence that the text layer is OCR based. I don't know whether any application uses the location information in the text layer for anything other to enable highlighting, copying, and pasting. It would be neat if a PDF to text application could use the location information as formatting hints and not just extract the raw text.

There is no requirement that a text layer be present and there is no requirement that a PDF document have any pixel images at all or a single text character, and it can have any mixture of them.

Pixel images in a PDF can usually be extracted and might be JPEG, JPEG2000, PNG, TIFF, or addional image types. Some images in PDF documents are vector based and can be rendered quite large with high quality and might require very little storage space.
j.p.s is offline   Reply With Quote
Old 09-08-2018, 06:06 PM   #84
j.p.s
Grand Sorcerer
j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.
 
Posts: 5,275
Karma: 98804578
Join Date: Apr 2011
Device: pb360
Quote:
Originally Posted by sealbeater View Post
Yes, they are. I have never seen a PDF that contains both txt and full page images. Please, attach one so we can view it.

OCR is the last thing you want, not the first.
A large fraction of the PDF books at archive.org consist of high resolution (usually 300 dpi or 600 dpi) full page images with a text layer riddled with OCR errors. They are too large to attach, but anyone can download and view them, and they are examples of books with hundreds of pages, each of which every page is a full page image and most pages have text that can be extracted with the usual PDF to text tools and can be highlighted, copied, and pasted.

I have no idea of the total number of such books, but there are quite a few, and archive.org is not the only source of such books.
j.p.s is offline   Reply With Quote
Old 09-08-2018, 06:39 PM   #85
sealbeater
Banned
sealbeater ought to be getting tired of karma fortunes by now.sealbeater ought to be getting tired of karma fortunes by now.sealbeater ought to be getting tired of karma fortunes by now.sealbeater ought to be getting tired of karma fortunes by now.sealbeater ought to be getting tired of karma fortunes by now.sealbeater ought to be getting tired of karma fortunes by now.sealbeater ought to be getting tired of karma fortunes by now.sealbeater ought to be getting tired of karma fortunes by now.sealbeater ought to be getting tired of karma fortunes by now.sealbeater ought to be getting tired of karma fortunes by now.sealbeater ought to be getting tired of karma fortunes by now.
 
Posts: 666
Karma: 1752814
Join Date: Jan 2008
Device: Sony Reader PRS-505 : Onyx Boox Max : Sony PRS-900 : Onyx Kepler Pro
Quote:
Originally Posted by j.p.s View Post
A large fraction of the PDF books at archive.org consist of high resolution (usually 300 dpi or 600 dpi) full page images with a text layer riddled with OCR errors.
I'll take a look but I've never yet run into one that had both.
sealbeater is offline   Reply With Quote
Old 09-08-2018, 08:18 PM   #86
DuckieTigger
Wizard
DuckieTigger ought to be getting tired of karma fortunes by now.DuckieTigger ought to be getting tired of karma fortunes by now.DuckieTigger ought to be getting tired of karma fortunes by now.DuckieTigger ought to be getting tired of karma fortunes by now.DuckieTigger ought to be getting tired of karma fortunes by now.DuckieTigger ought to be getting tired of karma fortunes by now.DuckieTigger ought to be getting tired of karma fortunes by now.DuckieTigger ought to be getting tired of karma fortunes by now.DuckieTigger ought to be getting tired of karma fortunes by now.DuckieTigger ought to be getting tired of karma fortunes by now.DuckieTigger ought to be getting tired of karma fortunes by now.
 
DuckieTigger's Avatar
 
Posts: 4,742
Karma: 246906703
Join Date: Dec 2011
Location: USA
Device: Oasis 3, Oasis 2, PW3, PW1, KT
Quote:
Originally Posted by j.p.s View Post
PDF is based on the postscript programming language.

PDF documents that have text that can be copy and pasted have an added text layer that is not used to render or print a page. The text on any given page in a PDF document might be rendered on the spot or might be part of a pixel based image. The source of the text layer may be generated from the source text or from OCR of a pixel based image. Lots of strange errors that are not in the rendered page are evidence that the text layer is OCR based. I don't know whether any application uses the location information in the text layer for anything other to enable highlighting, copying, and pasting. It would be neat if a PDF to text application could use the location information as formatting hints and not just extract the raw text.

There is no requirement that a text layer be present and there is no requirement that a PDF document have any pixel images at all or a single text character, and it can have any mixture of them.

Pixel images in a PDF can usually be extracted and might be JPEG, JPEG2000, PNG, TIFF, or addional image types. Some images in PDF documents are vector based and can be rendered quite large with high quality and might require very little storage space.
Aye, and I didn't say that there is an embedded image file for the full page. Just that they have it. What I meant is that all the information is inside. I even mentioned that to extract the full page images you can simply print the PDF file. Printing will render out a bitmap at the specified resolution which can be redirected and converted into the correct input format for the OCR software.
DuckieTigger is offline   Reply With Quote
Old 09-09-2018, 04:20 PM   #87
murraypaul
Interested Bystander
murraypaul ought to be getting tired of karma fortunes by now.murraypaul ought to be getting tired of karma fortunes by now.murraypaul ought to be getting tired of karma fortunes by now.murraypaul ought to be getting tired of karma fortunes by now.murraypaul ought to be getting tired of karma fortunes by now.murraypaul ought to be getting tired of karma fortunes by now.murraypaul ought to be getting tired of karma fortunes by now.murraypaul ought to be getting tired of karma fortunes by now.murraypaul ought to be getting tired of karma fortunes by now.murraypaul ought to be getting tired of karma fortunes by now.murraypaul ought to be getting tired of karma fortunes by now.
 
Posts: 3,725
Karma: 19728152
Join Date: Jun 2008
Device: Note 4, Kobo One
[deleted]

Last edited by murraypaul; 09-09-2018 at 08:09 PM.
murraypaul is offline   Reply With Quote
Old 09-15-2018, 09:19 PM   #88
Rand Brittain
Bookmaker
Rand Brittain ought to be getting tired of karma fortunes by now.Rand Brittain ought to be getting tired of karma fortunes by now.Rand Brittain ought to be getting tired of karma fortunes by now.Rand Brittain ought to be getting tired of karma fortunes by now.Rand Brittain ought to be getting tired of karma fortunes by now.Rand Brittain ought to be getting tired of karma fortunes by now.Rand Brittain ought to be getting tired of karma fortunes by now.Rand Brittain ought to be getting tired of karma fortunes by now.Rand Brittain ought to be getting tired of karma fortunes by now.Rand Brittain ought to be getting tired of karma fortunes by now.Rand Brittain ought to be getting tired of karma fortunes by now.
 
Posts: 416
Karma: 2143650
Join Date: Sep 2010
Device: Cybook Opus
What do people recommend these days to do smart extraction of the text of a non-scanned PDF into HTML or EPUB?
Rand Brittain is offline   Reply With Quote
Old 09-15-2018, 09:25 PM   #89
DNSB
Bibliophagist
DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.
 
DNSB's Avatar
 
Posts: 35,219
Karma: 145277352
Join Date: Jul 2010
Location: Vancouver
Device: Kobo Sage, Forma, Clara HD, Lenovo M8 FHD, Paperwhite 4, Tolino epos
Quote:
Originally Posted by Rand Brittain View Post
What do people recommend these days to do smart extraction of the text of a non-scanned PDF into HTML or EPUB?
Personally, I've been using calibre with heuristics enabled and then editing the resulting epub in Sigil.
DNSB is offline   Reply With Quote
Old 09-17-2018, 02:31 PM   #90
Difflugia
Testate Amoeba
Difflugia ought to be getting tired of karma fortunes by now.Difflugia ought to be getting tired of karma fortunes by now.Difflugia ought to be getting tired of karma fortunes by now.Difflugia ought to be getting tired of karma fortunes by now.Difflugia ought to be getting tired of karma fortunes by now.Difflugia ought to be getting tired of karma fortunes by now.Difflugia ought to be getting tired of karma fortunes by now.Difflugia ought to be getting tired of karma fortunes by now.Difflugia ought to be getting tired of karma fortunes by now.Difflugia ought to be getting tired of karma fortunes by now.Difflugia ought to be getting tired of karma fortunes by now.
 
Difflugia's Avatar
 
Posts: 3,049
Karma: 27300000
Join Date: Sep 2012
Device: Many Android devices, Kindle 2, Toshiba e755 PocketPC
Quote:
Originally Posted by sealbeater View Post
Yes, they are. I have never seen a PDF that contains both txt and full page images. Please, attach one so we can view it.

OCR is the last thing you want, not the first.
I've attached two-page excerpts from three commercial PDF books that I've bought. You can decide whether or not they invalidate what you've said. In case anyone cares, I used The PDF Toolkit to extract pages from the larger documents.

I'll note that PDF fonts are not fixed. For example, the first page of the "Text only.pdf" file that I linked contains the Greek phrase, ὁ υἱὸς τοῦ ἀνθρώπου. If I copy/paste that phrase, I get something far different: o" yi"oÁq toyÄ a! nurwpoy. That also happens in some English documents if the chosen font includes different glyphs for certain kerned pairs ("ff" is common). It's also possible to completely remap a font, either intentionally to hinder copy-paste or simply as a programming expedient. In those cases, OCR will give a much better result than simple text extraction. It's further possible to restore accurate copy/paste ability to such a document by adding the embedded text layer, even though there's already a "text" layer used to render the page.
Attached Files
File Type: pdf Text only.pdf (36.8 KB, 160 views)
File Type: pdf Images only.pdf (199.6 KB, 163 views)
File Type: pdf Images and text.pdf (224.4 KB, 142 views)
Difflugia is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
PDF in epub? Floeee Software 3 10-20-2009 05:52 PM
PDFTOEPUB BY DNAML- WARNING mets News 0 09-21-2009 01:16 PM
Google releases 1 million public domain books in ePub format joedevon News 25 09-02-2009 05:13 PM


All times are GMT -4. The time now is 11:23 AM.


MobileRead.com is a privately owned, operated and funded community.