I have a 350mb PDF. How do I best convert this? - Page 2

Ghitulescu · 02-11-2019, 04:34 PM

I am curious too why it does this way.
Yes, I have read the FAQ, the Sticky and some of the related threads, yet no clear answer why:

... so I have too a PDF coming out of scanner.
In Reader I can select the text, and it's rather correct. So the PDF file also contains text, not only images.
Yet, calibre outputs a bunch of images, one per page.

So, again, why calibre does not use the "hidden text"?
Yes, I know it's not best to use PDFs... but what to do when the only source is one of them?!

DNSB · 02-12-2019, 12:41 AM

Quote:

Originally Posted by Ghitulescu

I am curious too why it does this way.
Yes, I have read the FAQ, the Sticky and some of the related threads, yet no clear answer why:

... so I have too a PDF coming out of scanner.
In Reader I can select the text, and it's rather correct. So the PDF file also contains text, not only images.
Yet, calibre outputs a bunch of images, one per page.

So, again, why calibre does not use the "hidden text"?
Yes, I know it's not best to use PDFs... but what to do when the only source is one of them?!

I have one commercially produced scan of a book from the 1870's* The images are not the greatest but they are what the original book looked like. The text plane is useful for searching but when I take a close look at it, it has a multitude of OCR errors which make it painful to read.

* The book was originally a two volume set which includes household tips (use sulphuric acid on your windows to prevent frost), photography including creating your own wet plates and much else. If I was stranded on a desert island, that is a set of books that would be handy though by today's standards, many of the items would be considered extreme safety hazards.

Ghitulescu · 02-12-2019, 02:34 AM

Quote:

Originally Posted by DNSB

I have one commercially produced scan of a book from the 1870's*
...
The text plane is useful for searching but when I take a close look at it, it has a multitude of OCR errors which make it painful to read.

Yes, it's usually this way. Old books used very "serifed"/decorative fonts that are rather difficult to be OCRed by simple/cheap software. Yes, it's annoying to replace all "m" by "r n" or "i n", all "h" by "li" and stuff. But that text exists.
Yet, I would really like to know why calibre does not see that text, or doesn't want to use it.
In my case it's a PhD theseis that was typewritten and the text (sort of Courier) is a piece of cake to OCR (and it was OCRed during the scanning).

Maybe this is/was not clear: not because it's large (they have to, because they also have images or images only), but because of the PDF->EPUB conversion. I did not want to open a new thread for a problem that was "solved" in this manner: "DO NOT use PDFs!"

kovidgoyal · 02-12-2019, 03:42 AM

it's not calibre that decides to use or not the text, it's pdftohtml from the poppler project, which calibre uses for initial content extraction from PDF files.

stumped · 02-12-2019, 06:09 AM

Quote:

Originally Posted by DNSB

..

* The book was originally a two volume set which includes household tips (use sulphuric acid on your windows to prevent frost), photography including creating your own wet plates and much else. If I was stranded on a desert island, that is a set of books that would be handy though by today's standards, many of the items would be considered extreme safety hazards.

but not many desert islands have handy supplies of sulphuirc acid, or even windows to apply it to ???

Ghitulescu · 02-12-2019, 11:14 AM

Quote:

Originally Posted by stumped

but not many desert islands have handy supplies of sulphuirc acid, or even windows to apply it to ???

I suggest you then to read L'Île mystérieuse (The Mysterious Island) by Jules Verne, in particular Chapter XVII.

Pajamaman · 02-12-2019, 09:28 PM

You could split up the file into smaller files. At least it would read quicker. Adobe acrobat pro does it, but is expensive. Surprisingly ig seems Cbrome will do it.
https://superuser.com/questions/6847...ile-in-windows

BetterRed · 02-12-2019, 09:48 PM

Quote:

Originally Posted by Pajamaman

You could split up the file into smaller files. At least it would read quicker. Adobe acrobat pro does it, but is expensive. Surprisingly ig seems Cbrome will do it.
https://superuser.com/questions/6847...ile-in-windows

Good idea - I've used a freebie called PDFsam Basic to good effect to split and extract chapters from PDFs

BR

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Convert Image PDF to PDF with text or other ebook format.	Memes	PDF	7	05-01-2023 04:52 PM
How to keep bold texts in converted pdf, when you convert docx to pdf?	Foxitoff	Conversion	1	11-04-2015 10:24 PM
Convert epub to pdf, with notes with main text in the pdf?	8140david	ePub	1	06-18-2015 01:13 PM
Convert epub to pdf, with notes with main text in the pdf?	8140david	Conversion	1	06-18-2015 11:02 AM

02-11-2019, 04:34 PM	#16
Ghitulescu Fanatic Posts: 563 Karma: 403106 Join Date: Aug 2014 Device: PRS-T1	I am curious too why it does this way. Yes, I have read the FAQ, the Sticky and some of the related threads, yet no clear answer why: ... so I have too a PDF coming out of scanner. In Reader I can select the text, and it's rather correct. So the PDF file also contains text, not only images. Yet, calibre outputs a bunch of images, one per page. So, again, why calibre does not use the "hidden text"? Yes, I know it's not best to use PDFs... but what to do when the only source is one of them?!

02-12-2019, 03:42 AM	#19
kovidgoyal creator of calibre Posts: 43,858 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	it's not calibre that decides to use or not the text, it's pdftohtml from the poppler project, which calibre uses for initial content extraction from PDF files.

02-12-2019, 09:28 PM	#22
Pajamaman Wizard Posts: 2,827 Karma: 10700629 Join Date: May 2016 Location: Canada Device: Onyx Nova	You could split up the file into smaller files. At least it would read quicker. Adobe acrobat pro does it, but is expensive. Surprisingly ig seems Cbrome will do it. https://superuser.com/questions/6847...ile-in-windows