converting PDF to ePub tips

michaelbr · 08-03-2023, 02:35 AM

I understand that PDF is not very friendly when converting to ePub, I've tried few online converters (seems most use Calibre) and the result requires a lot of work. It seems browser can open easily a PDF file keeping the formatting, just wondering if anyone tried to convert PDF to ePub usinig the following method:
1 - open PDF file in browser (Brave in my case)
2 - copy the text from browser and paste into an ePub editing app (Sigil)
3 - copy/paste links/fotos manually
the step 3 requires a lot of work if there are lot of links/photos, anyone knows any tools/tips to make it automatic or easier than manually?
ps: OS MX Linux

DNSB · 08-03-2023, 12:27 PM

Michael, please note that posting multiple copies of the same message is not permitted on MobileRead. Posting the same message in this forum and the Sigil forum is unlikely to get you any extra help and is annoying to those who find themselves starting to reply before realizing that they already replied.

Sarmat89 · 08-03-2023, 07:23 PM

What I found very useful was using LibreOffice Draw to convert PDF to a flat ODG document, which can be easily converted to a markup language like XML.
This is the only sure way to work around paragraphs, fonts, and page breaks. Automatic tools often fail with those.

michaelbr · 08-04-2023, 01:25 PM

Quote:

Originally Posted by DNSB

Michael, please note that posting multiple copies of the same message is not permitted on MobileRead. Posting the same message in this forum and the Sigil forum is unlikely to get you any extra help and is annoying to those who find themselves starting to reply before realizing that they already replied.

Sorry, won't happen again.

michaelbr · 08-04-2023, 01:26 PM

Quote:

Originally Posted by Sarmat89

What I found very useful was using LibreOffice Draw to convert PDF to a flat ODG document, which can be easily converted to a markup language like XML.
This is the only sure way to work around paragraphs, fonts, and page breaks. Automatic tools often fail with those.

Thanks for this tip, will give it a try.

Pajamaman · 08-04-2023, 02:10 PM

Quote:

Originally Posted by michaelbr

Thanks for this tip, will give it a try.

I often find abby finereader to be the best way. Convert to hrml then epub

AlanHK · 08-06-2023, 06:27 AM

You need to be aware that there are several different kinds of PDFs. One is made by scanning text.
So these are sets of images of text. Many apps, like Adobe Acrobat, will try to OCR this and embed an invisible text layer which allows you to search for text and copy chunks of it.

Other is created by a DTP app, like InDesign. The text is actually text, not images of text, if you zoom in max it will still be smooth.

With either of these, you can get the text cursor and select all, copy and paste in Word (or equivalent). From there, save as "web page, filtered" and you get HTML, which you can import to Sigil, or Calibre.
You will have a lot more work then to clean it up.
If it was OCR text, there will be lots of errors. Spellcheck will help find many. Also you need to review the pagebreaks, which insert spurious paragraph breaks. Then rationalise the CSS.

If it's a scan document, you can also try a full OCR app like ABBYY which can export directly to HTML or some versions do epub.

And there is a command line app , "pdftohtml" which does as it says. See https://poppler.freedesktop.org/

No matter how you do it you need to invest hours at least to clean up and check.
Otherwise you will get garbage, see the awful epubs automatically created by OCR at Internet Archive.

JSWolf · 08-10-2023, 05:58 PM

The only way you can 100% be sure that the converted PDF is error free is to do a 100% A/B comparison of the PDF and the resulting ePub.

Do you want to do this? Do you have to convert PDF > ePub?

michaelbr · 08-12-2023, 02:24 AM

Quote:

Originally Posted by Pajamaman

I often find abby finereader to be the best way. Convert to hrml then epub

It seems Abby is only for Windows, I gave up Windows few years back.

michaelbr · 08-12-2023, 02:28 AM

Quote:

Originally Posted by AlanHK

You need to be aware that there are several different kinds of PDFs. One is made by scanning text.
So these are sets of images of text. Many apps, like Adobe Acrobat, will try to OCR this and embed an invisible text layer which allows you to search for text and copy chunks of it.

Other is created by a DTP app, like InDesign. The text is actually text, not images of text, if you zoom in max it will still be smooth.

With either of these, you can get the text cursor and select all, copy and paste in Word (or equivalent). From there, save as "web page, filtered" and you get HTML, which you can import to Sigil, or Calibre.
You will have a lot more work then to clean it up.
If it was OCR text, there will be lots of errors. Spellcheck will help find many. Also you need to review the pagebreaks, which insert spurious paragraph breaks. Then rationalise the CSS.

If it's a scan document, you can also try a full OCR app like ABBYY which can export directly to HTML or some versions do epub.

And there is a command line app , "pdftohtml" which does as it says. See https://poppler.freedesktop.org/

No matter how you do it you need to invest hours at least to clean up and check.
Otherwise you will get garbage, see the awful epubs automatically created by OCR at Internet Archive.

Thanks for this detailed explanation and tips, will check them out.

michaelbr · 08-12-2023, 02:30 AM

Quote:

Originally Posted by JSWolf

The only way you can 100% be sure that the converted PDF is error free is to do a 100% A/B comparison of the PDF and the resulting ePub.

Do you want to do this? Do you have to convert PDF > ePub?

Are you offering your service? Or is there a tool to do that?

Quoth · 08-12-2023, 04:39 AM

Probably neither.

DNSB · 08-12-2023, 04:37 PM

Quote:

Originally Posted by michaelbr

Are you offering your service? Or is there a tool to do that?

The tool used to do that is your Mark 1 eyeball. I suspect that if you want someone else to use their eyeballs on your behalf, it will get expensive.

You will have to bring up both the ePub and PDF on your screen and check line by line for differences.

retiredbiker · 08-12-2023, 09:08 PM

Quote:

Originally Posted by michaelbr

It seems Abby is only for Windows, I gave up Windows few years back.

I use Ubuntu.

For OCR, try OCRFeeder as a front end to tesseract. Tesseract is very accurate given a good image. I do a page at a time, defining the text area manually. It can then handle multiple columns, advertisements, "continued on page 99" and so on. OCRFeeder is very good at connecting the lines into correct paragraphs, dealing with end-of-line hyphens, and so on.

Might seem slow, but this as actually the quick part of the process...you will have to proof read and correct no matter what.

Pdftopng will get images out of pdfs, that works better than OCRing the pdf itself.

ImageMagick can tame image files that are too large and slow down tesseract. Scan Taylor Advanced and Unpaper may be useful; I find them black magic, but I use them if needed.

If you want to try and use existing text, pdftohtml will sometimes fail while pdftotext will work. No idea why. If you use the pdftotext, try the --layout option and get ready for a lot of regex to tame the spacing.

retiredbiker · 08-12-2023, 09:45 PM

Quote:

Originally Posted by michaelbr

3 - copy/paste links/fotos manually
the step 3 requires a lot of work if there are lot of links/photos, anyone knows any tools/tips to make it automatic or easier than manually?
ps: OS MX Linux

If you use LibreOffice Writer to edit the text, do the links for footnotes, endnotes and so on there. They will convert nicely into epub using the Calibre conversion. I find it much easier to make the links in Writer rather than in the epub editor.

Dealing with epub images is a pain and depends on your audience...what do they use for readers? The most basic advice for physical ereaders like Kindle or Kobo is to declare the width as a percent and the height auto in the CSS...don't use absolute units. Then surround the <img...> line with a <p> or <div> to make it center/right/left. Like this:

<p class="center">
<img alt="" class="widepic" src="../Images/c02.jpg"/>
</p>

where "widepic" says in the CSS:

.widepic {
height: auto;
width: 98%;
}

And "center" is just that:

.center {
text-align: center;
text-indent: 0;
margin: .5em 0 .5em 0;
}

So you can pull the images into the Writer version, and apply something like this at the epub editor stage. I can usually replace what the conversion does with this using some simple regex. This is almost always readable on various readers or apps..

08-03-2023, 02:35 AM	#1
michaelbr Connoisseur Posts: 77 Karma: 10 Join Date: Aug 2010 Location: Murcia/Spain Device: Android 12	converting PDF to ePub tips I understand that PDF is not very friendly when converting to ePub, I've tried few online converters (seems most use Calibre) and the result requires a lot of work. It seems browser can open easily a PDF file keeping the formatting, just wondering if anyone tried to convert PDF to ePub usinig the following method: 1 - open PDF file in browser (Brave in my case) 2 - copy the text from browser and paste into an ePub editing app (Sigil) 3 - copy/paste links/fotos manually the step 3 requires a lot of work if there are lot of links/photos, anyone knows any tools/tips to make it automatic or easier than manually? ps: OS MX Linux

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Large book with a 2-column structure, PDF -> EPUB .. any tips on this one?	cow	Conversion	4	03-09-2022 04:29 PM
A little help converting ePUB to PDF	GuilleCrK	Conversion	8	01-07-2019 11:05 AM
Ultimate PDF to Epub/Mobi conversion tips	sinan	Workshop	43	08-01-2017 12:46 AM
converting pdf to epub	Gagan	ePub	65	06-28-2017 11:57 PM
Converting PDF Tips	baker2gs	Amazon Kindle	4	03-10-2010 10:53 PM

08-03-2023, 12:27 PM	#2
DNSB Bibliophagist Posts: 36,391 Karma: 145735554 Join Date: Jul 2010 Location: Vancouver Device: Kobo Sage, Libra Colour, Lenovo M8 FHD, Paperwhite 4, Tolino epos	Michael, please note that posting multiple copies of the same message is not permitted on MobileRead. Posting the same message in this forum and the Sigil forum is unlikely to get you any extra help and is annoying to those who find themselves starting to reply before realizing that they already replied.

08-03-2023, 07:23 PM	#3
Sarmat89 Evangelist Posts: 483 Karma: 2267928 Join Date: Nov 2015 Device: none	What I found very useful was using LibreOffice Draw to convert PDF to a flat ODG document, which can be easily converted to a markup language like XML. This is the only sure way to work around paragraphs, fonts, and page breaks. Automatic tools often fail with those.

08-06-2023, 06:27 AM	#7
AlanHK Guru Posts: 668 Karma: 929286 Join Date: Apr 2014 Device: PW-3, iPad, Android phone	You need to be aware that there are several different kinds of PDFs. One is made by scanning text. So these are sets of images of text. Many apps, like Adobe Acrobat, will try to OCR this and embed an invisible text layer which allows you to search for text and copy chunks of it. Other is created by a DTP app, like InDesign. The text is actually text, not images of text, if you zoom in max it will still be smooth. With either of these, you can get the text cursor and select all, copy and paste in Word (or equivalent). From there, save as "web page, filtered" and you get HTML, which you can import to Sigil, or Calibre. You will have a lot more work then to clean it up. If it was OCR text, there will be lots of errors. Spellcheck will help find many. Also you need to review the pagebreaks, which insert spurious paragraph breaks. Then rationalise the CSS. If it's a scan document, you can also try a full OCR app like ABBYY which can export directly to HTML or some versions do epub. And there is a command line app , "pdftohtml" which does as it says. See https://poppler.freedesktop.org/ No matter how you do it you need to invest hours at least to clean up and check. Otherwise you will get garbage, see the awful epubs automatically created by OCR at Internet Archive.

08-10-2023, 05:58 PM	#8
JSWolf Resident Curmudgeon Posts: 74,505 Karma: 129668758 Join Date: Nov 2006 Location: Roslindale, Massachusetts Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3	The only way you can 100% be sure that the converted PDF is error free is to do a 100% A/B comparison of the PDF and the resulting ePub. Do you want to do this? Do you have to convert PDF > ePub?

08-12-2023, 04:39 AM	#12
Quoth the rook, bossing Never. Posts: 11,493 Karma: 87454321 Join Date: Jun 2017 Location: Ireland Device: All 4 Kinds: epub eink, Kindle, android eink, NxtPaper11	Probably neither.