What pdf format can be converted to epub?

KMalsi · 10-06-2021, 02:55 AM

I scanned a paperback book (after getting permission from the author) into a searchable pdf. I've attached one page of the pdf here. It looks like an image but the text is searchable.

I assumed this was all is needed to convert the pdf into a reflowable epub. I did the conversion but the epub output looked exactly the same as the pdf! I've attached the epub here as well.

How do I convert such a searchable pdf into an editable epub with xhtml files and images?

BetterRed · 10-06-2021, 07:29 AM

Quote:

Originally Posted by KMalsi

I scanned a paperback book (after getting permission from the author) into a searchable pdf. I've attached one page of the pdf here. It looks like an image but the text is searchable.

I assumed this was all is needed to convert the pdf into a reflowable epub. I did the conversion but the epub output looked exactly the same as the pdf! I've attached the epub here as well.

How do I convert such a searchable pdf into an editable epub with xhtml files and images?

I was able to open the PDF with current Word, save it as DOCX, and convert it to EPUB with calibre, the attached ZIP has the DOCX and EPUB.

Word has been my default tool for converting 'simple' PDFs for a while. If there are a lot of tables and images with wrap around text etc (e.g. coffee table cook books) it's not so good. If it barfs because the pdf is too big, grab one of the free PDF split utility tools, and chop into two or more chunks on chapter boundaries - I use one called PDFSam.

If you don't have access to a recent version of Word try LO Writer to do the convert to DOCX.

BR

KMalsi · 10-06-2021, 07:40 AM

Quote:

Originally Posted by BetterRed

I was able to open the PDF with current Word, save it as DOCX, and convert it to EPUB with calibre, the attached ZIP has the DOCX and EPUB.

Word has been my default tool for converting 'simple' PDFs for a while. If there are a lot of tables and images with wrap around text etc (e.g. coffee table cook books) it's not so good. If it barfs because the pdf is too big, grab one of the free PDF split utility tools, and chop into two or more chunks on chapter boundaries - I use one called PDFSam.

If you don't have access to a recent version of Word try LO Writer to do the convert to DOCX.

BR

Thanks! I read here that people seemed to have been able to convert directly from pdf. So I thought it was some settings in Calibre that I had missed. I do have MS Word and am able to convert the pdf to docx although there was a lot of editing needed to get it right. Now I know how it’s done.

theducks · 10-06-2021, 10:57 AM

PDF is a LAYOUT/paste-up format originally to allow users to print the same as every other user.

HOW it was made, affects the quality of conversion. PDF is not a Linear file like an EPUB (every item/entry in order of use. Start to finish).
What it contains (pictures or charts...), affects the quality of conversion.

So back to your Q. An EPUB created FROM HTML has a better chance of converting back because the source was linear (and probably has no ligatures)

retiredbiker · 10-06-2021, 11:14 AM

I have run across a number of pdfs that Calibre would not convert, even though they were searchable, that is, contained some sort of text. Calibre uses pdftohtml to extract the text. In the case of the ones I've found, using pdftohtml from the CL failed, but using pedtotext worked. I guess Word can find some text Calibre can't.

A pdf can contain just about anything. As theducks said, it depends on how it was made.

KMalsi · 10-06-2021, 11:41 AM

I’ll need to read up and learn more about these.

I paid someone on Fiverr to convert this pdf to epub for me. When he sent the epub over, I was able to see the xhtml files, the css style sheet, and all the images jpgs when I load it into Calibre’s ebook editor. I learned html 2 decades ago but can still remember some of it, so I was able to fine tune the epub.

I then asked him how he managed to convert the pdf to epub and he told me that he first converted the pdf to Word and then extracted the images and converted the word document to xhtml in Calibre. So he used BR’s method.

BetterRed · 10-06-2021, 04:36 PM

@KMalsi - there are a couple of Addins for Word that can help tidy up PDF artefacts:

MobileRead: Toxaris's eBook Tools MS Word add-in.

I also use the Translator Tools add-in, it has features which are not translator specific, it's not free.

BR

KMalsi · 10-06-2021, 08:14 PM

Quote:

Originally Posted by BetterRed

@KMalsi - there are a couple of Addins for Word that can help tidy up PDF artefacts:

MobileRead: Toxaris's eBook Tools MS Word add-in.

I also use the Translator Tools add-in, it has features which are not translator specific, it's not free.

BR

Thanks BR!

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
PDF to EPUB job deletes converted file	SuperGraham	Conversion	2	03-26-2018 07:14 AM
Improving PDF output from converted ePub	Hawkeye1969	Conversion	2	04-24-2017 11:48 PM
How I converted an epub dictionary to mobi format	Mindtrap	Workshop	2	07-06-2013 04:33 PM
pdf -> epub, only 2/108 pages converted	justapuppy	Conversion	6	07-22-2011 02:04 PM
PDF to EPUB: Converted document looks nothing like how it's supposed to look. Help	CameraTester	Conversion	2	07-19-2011 02:46 AM

10-06-2021, 10:57 AM	#4
theducks Well trained by Cats Posts: 31,370 Karma: 62500000 Join Date: Aug 2009 Location: The Central Coast of California Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A	PDF is a LAYOUT/paste-up format originally to allow users to print the same as every other user. HOW it was made, affects the quality of conversion. PDF is not a Linear file like an EPUB (every item/entry in order of use. Start to finish). What it contains (pictures or charts...), affects the quality of conversion. So back to your Q. An EPUB created FROM HTML has a better chance of converting back because the source was linear (and probably has no ligatures)

10-06-2021, 11:14 AM	#5
retiredbiker Evangelist Posts: 458 Karma: 3886916 Join Date: May 2013 Location: Ontario, Canada Device: Kindle KB, Oasis, Pop_Os!, Kobo Forma	I have run across a number of pdfs that Calibre would not convert, even though they were searchable, that is, contained some sort of text. Calibre uses pdftohtml to extract the text. In the case of the ones I've found, using pdftohtml from the CL failed, but using pedtotext worked. I guess Word can find some text Calibre can't. A pdf can contain just about anything. As theducks said, it depends on how it was made.

10-06-2021, 11:41 AM	#6
KMalsi Junior Member Posts: 4 Karma: 10 Join Date: Oct 2021 Device: iPad Pro	I’ll need to read up and learn more about these. I paid someone on Fiverr to convert this pdf to epub for me. When he sent the epub over, I was able to see the xhtml files, the css style sheet, and all the images jpgs when I load it into Calibre’s ebook editor. I learned html 2 decades ago but can still remember some of it, so I was able to fine tune the epub. I then asked him how he managed to convert the pdf to epub and he told me that he first converted the pdf to Word and then extracted the images and converted the word document to xhtml in Calibre. So he used BR’s method.

10-06-2021, 04:36 PM	#7
BetterRed null operator (he/him) Posts: 22,085 Karma: 30277960 Join Date: Mar 2012 Location: Sydney Australia Device: none	@KMalsi - there are a couple of Addins for Word that can help tidy up PDF artefacts: MobileRead: Toxaris's eBook Tools MS Word add-in. I also use the Translator Tools add-in, it has features which are not translator specific, it's not free. BR

Advert

Advert