09-08-2018, 05:36 AM | #76 | ||
Grand Sorcerer
Posts: 7,168
Karma: 63764653
Join Date: Feb 2009
Device: Kobo Glo HD
|
Quote:
Quote:
|
||
09-08-2018, 06:16 AM | #77 |
Wizard
Posts: 4,742
Karma: 246906703
Join Date: Dec 2011
Location: USA
Device: Oasis 3, Oasis 2, PW3, PW1, KT
|
No they are not either or. Even the PDF that contains text has full page images. You simply create them by printing the PDF into individual images for each page. OCR has a better chance to succeed than possibly horribly garbled text inside that won't tell you where the header is, for example.
Last edited by DuckieTigger; 09-08-2018 at 06:19 AM. |
Advert | |
|
09-08-2018, 06:22 AM | #78 |
Guru
Posts: 776
Karma: 2751519
Join Date: Jul 2010
Location: UK
Device: PW2, Nexus7
|
When I said a couple of days I was just trying to be generous and not hold Sealbleater to a literal 20 mins. I would be impressed if this work could be achieved in 2 days, never mind 20 minutes.
|
09-08-2018, 12:54 PM | #79 |
Guru
Posts: 776
Karma: 2751519
Join Date: Jul 2010
Location: UK
Device: PW2, Nexus7
|
End of feeding: sorry Sealbleater, no more fish!
Last edited by Agama; 09-08-2018 at 01:27 PM. |
09-08-2018, 03:10 PM | #80 | ||
Banned
Posts: 666
Karma: 1752814
Join Date: Jan 2008
Device: Sony Reader PRS-505 : Onyx Boox Max : Sony PRS-900 : Onyx Kepler Pro
|
Quote:
Quote:
LOL. Whatever you have to tell yourself champ. |
||
Advert | |
|
09-08-2018, 03:11 PM | #81 | |
Banned
Posts: 666
Karma: 1752814
Join Date: Jan 2008
Device: Sony Reader PRS-505 : Onyx Boox Max : Sony PRS-900 : Onyx Kepler Pro
|
Quote:
OCR is the last thing you want, not the first. |
|
09-08-2018, 03:11 PM | #82 |
Banned
Posts: 666
Karma: 1752814
Join Date: Jan 2008
Device: Sony Reader PRS-505 : Onyx Boox Max : Sony PRS-900 : Onyx Kepler Pro
|
|
09-08-2018, 05:51 PM | #83 | |
Grand Sorcerer
Posts: 5,278
Karma: 98804578
Join Date: Apr 2011
Device: pb360
|
Quote:
PDF documents that have text that can be copy and pasted have an added text layer that is not used to render or print a page. The text on any given page in a PDF document might be rendered on the spot or might be part of a pixel based image. The source of the text layer may be generated from the source text or from OCR of a pixel based image. Lots of strange errors that are not in the rendered page are evidence that the text layer is OCR based. I don't know whether any application uses the location information in the text layer for anything other to enable highlighting, copying, and pasting. It would be neat if a PDF to text application could use the location information as formatting hints and not just extract the raw text. There is no requirement that a text layer be present and there is no requirement that a PDF document have any pixel images at all or a single text character, and it can have any mixture of them. Pixel images in a PDF can usually be extracted and might be JPEG, JPEG2000, PNG, TIFF, or addional image types. Some images in PDF documents are vector based and can be rendered quite large with high quality and might require very little storage space. |
|
09-08-2018, 06:06 PM | #84 | |
Grand Sorcerer
Posts: 5,278
Karma: 98804578
Join Date: Apr 2011
Device: pb360
|
Quote:
I have no idea of the total number of such books, but there are quite a few, and archive.org is not the only source of such books. |
|
09-08-2018, 06:39 PM | #85 |
Banned
Posts: 666
Karma: 1752814
Join Date: Jan 2008
Device: Sony Reader PRS-505 : Onyx Boox Max : Sony PRS-900 : Onyx Kepler Pro
|
|
09-08-2018, 08:18 PM | #86 | |
Wizard
Posts: 4,742
Karma: 246906703
Join Date: Dec 2011
Location: USA
Device: Oasis 3, Oasis 2, PW3, PW1, KT
|
Quote:
|
|
09-09-2018, 04:20 PM | #87 |
Interested Bystander
Posts: 3,725
Karma: 19728152
Join Date: Jun 2008
Device: Note 4, Kobo One
|
[deleted]
Last edited by murraypaul; 09-09-2018 at 08:09 PM. |
09-15-2018, 09:19 PM | #88 |
Bookmaker
Posts: 416
Karma: 2143650
Join Date: Sep 2010
Device: Cybook Opus
|
What do people recommend these days to do smart extraction of the text of a non-scanned PDF into HTML or EPUB?
|
09-15-2018, 09:25 PM | #89 |
Bibliophagist
Posts: 35,401
Karma: 145435140
Join Date: Jul 2010
Location: Vancouver
Device: Kobo Sage, Forma, Clara HD, Lenovo M8 FHD, Paperwhite 4, Tolino epos
|
|
09-17-2018, 02:31 PM | #90 | |
Testate Amoeba
Posts: 3,049
Karma: 27300000
Join Date: Sep 2012
Device: Many Android devices, Kindle 2, Toshiba e755 PocketPC
|
Quote:
I'll note that PDF fonts are not fixed. For example, the first page of the "Text only.pdf" file that I linked contains the Greek phrase, ὁ υἱὸς τοῦ ἀνθρώπου. If I copy/paste that phrase, I get something far different: o" yi"oÁq toyÄ a! nurwpoy. That also happens in some English documents if the chosen font includes different glyphs for certain kerned pairs ("ff" is common). It's also possible to completely remap a font, either intentionally to hinder copy-paste or simply as a programming expedient. In those cases, OCR will give a much better result than simple text extraction. It's further possible to restore accurate copy/paste ability to such a document by adding the embedded text layer, even though there's already a "text" layer used to render the page. |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
PDF in epub? | Floeee | Software | 3 | 10-20-2009 05:52 PM |
PDFTOEPUB BY DNAML- WARNING | mets | News | 0 | 09-21-2009 01:16 PM |
Google releases 1 million public domain books in ePub format | joedevon | News | 25 | 09-02-2009 05:13 PM |