Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Formats > ePub

Notices

Reply
 
Thread Tools Search this Thread
Old 08-03-2023, 02:35 AM   #1
michaelbr
Connoisseur
michaelbr began at the beginning.
 
michaelbr's Avatar
 
Posts: 77
Karma: 10
Join Date: Aug 2010
Location: Murcia/Spain
Device: Android 12
converting PDF to ePub tips

I understand that PDF is not very friendly when converting to ePub, I've tried few online converters (seems most use Calibre) and the result requires a lot of work. It seems browser can open easily a PDF file keeping the formatting, just wondering if anyone tried to convert PDF to ePub usinig the following method:
1 - open PDF file in browser (Brave in my case)
2 - copy the text from browser and paste into an ePub editing app (Sigil)
3 - copy/paste links/fotos manually
the step 3 requires a lot of work if there are lot of links/photos, anyone knows any tools/tips to make it automatic or easier than manually?
ps: OS MX Linux
michaelbr is offline   Reply With Quote
Old 08-03-2023, 12:27 PM   #2
DNSB
Bibliophagist
DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.
 
DNSB's Avatar
 
Posts: 36,391
Karma: 145735554
Join Date: Jul 2010
Location: Vancouver
Device: Kobo Sage, Libra Colour, Lenovo M8 FHD, Paperwhite 4, Tolino epos
Michael, please note that posting multiple copies of the same message is not permitted on MobileRead. Posting the same message in this forum and the Sigil forum is unlikely to get you any extra help and is annoying to those who find themselves starting to reply before realizing that they already replied.
DNSB is offline   Reply With Quote
Old 08-03-2023, 07:23 PM   #3
Sarmat89
Evangelist
Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.
 
Posts: 483
Karma: 2267928
Join Date: Nov 2015
Device: none
What I found very useful was using LibreOffice Draw to convert PDF to a flat ODG document, which can be easily converted to a markup language like XML.
This is the only sure way to work around paragraphs, fonts, and page breaks. Automatic tools often fail with those.
Sarmat89 is offline   Reply With Quote
Old 08-04-2023, 01:25 PM   #4
michaelbr
Connoisseur
michaelbr began at the beginning.
 
michaelbr's Avatar
 
Posts: 77
Karma: 10
Join Date: Aug 2010
Location: Murcia/Spain
Device: Android 12
Quote:
Originally Posted by DNSB View Post
Michael, please note that posting multiple copies of the same message is not permitted on MobileRead. Posting the same message in this forum and the Sigil forum is unlikely to get you any extra help and is annoying to those who find themselves starting to reply before realizing that they already replied.
Sorry, won't happen again.
michaelbr is offline   Reply With Quote
Old 08-04-2023, 01:26 PM   #5
michaelbr
Connoisseur
michaelbr began at the beginning.
 
michaelbr's Avatar
 
Posts: 77
Karma: 10
Join Date: Aug 2010
Location: Murcia/Spain
Device: Android 12
Quote:
Originally Posted by Sarmat89 View Post
What I found very useful was using LibreOffice Draw to convert PDF to a flat ODG document, which can be easily converted to a markup language like XML.
This is the only sure way to work around paragraphs, fonts, and page breaks. Automatic tools often fail with those.
Thanks for this tip, will give it a try.
michaelbr is offline   Reply With Quote
Old 08-04-2023, 02:10 PM   #6
Pajamaman
Wizard
Pajamaman ought to be getting tired of karma fortunes by now.Pajamaman ought to be getting tired of karma fortunes by now.Pajamaman ought to be getting tired of karma fortunes by now.Pajamaman ought to be getting tired of karma fortunes by now.Pajamaman ought to be getting tired of karma fortunes by now.Pajamaman ought to be getting tired of karma fortunes by now.Pajamaman ought to be getting tired of karma fortunes by now.Pajamaman ought to be getting tired of karma fortunes by now.Pajamaman ought to be getting tired of karma fortunes by now.Pajamaman ought to be getting tired of karma fortunes by now.Pajamaman ought to be getting tired of karma fortunes by now.
 
Pajamaman's Avatar
 
Posts: 2,827
Karma: 10700629
Join Date: May 2016
Location: Canada
Device: Onyx Nova
Quote:
Originally Posted by michaelbr View Post
Thanks for this tip, will give it a try.

I often find abby finereader to be the best way. Convert to hrml then epub
Pajamaman is offline   Reply With Quote
Old 08-06-2023, 06:27 AM   #7
AlanHK
Guru
AlanHK ought to be getting tired of karma fortunes by now.AlanHK ought to be getting tired of karma fortunes by now.AlanHK ought to be getting tired of karma fortunes by now.AlanHK ought to be getting tired of karma fortunes by now.AlanHK ought to be getting tired of karma fortunes by now.AlanHK ought to be getting tired of karma fortunes by now.AlanHK ought to be getting tired of karma fortunes by now.AlanHK ought to be getting tired of karma fortunes by now.AlanHK ought to be getting tired of karma fortunes by now.AlanHK ought to be getting tired of karma fortunes by now.AlanHK ought to be getting tired of karma fortunes by now.
 
AlanHK's Avatar
 
Posts: 668
Karma: 929286
Join Date: Apr 2014
Device: PW-3, iPad, Android phone
You need to be aware that there are several different kinds of PDFs. One is made by scanning text.
So these are sets of images of text. Many apps, like Adobe Acrobat, will try to OCR this and embed an invisible text layer which allows you to search for text and copy chunks of it.

Other is created by a DTP app, like InDesign. The text is actually text, not images of text, if you zoom in max it will still be smooth.

With either of these, you can get the text cursor and select all, copy and paste in Word (or equivalent). From there, save as "web page, filtered" and you get HTML, which you can import to Sigil, or Calibre.
You will have a lot more work then to clean it up.
If it was OCR text, there will be lots of errors. Spellcheck will help find many. Also you need to review the pagebreaks, which insert spurious paragraph breaks. Then rationalise the CSS.

If it's a scan document, you can also try a full OCR app like ABBYY which can export directly to HTML or some versions do epub.

And there is a command line app , "pdftohtml" which does as it says. See https://poppler.freedesktop.org/

No matter how you do it you need to invest hours at least to clean up and check.
Otherwise you will get garbage, see the awful epubs automatically created by OCR at Internet Archive.
AlanHK is offline   Reply With Quote
Old 08-10-2023, 05:58 PM   #8
JSWolf
Resident Curmudgeon
JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.
 
JSWolf's Avatar
 
Posts: 74,505
Karma: 129668758
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
The only way you can 100% be sure that the converted PDF is error free is to do a 100% A/B comparison of the PDF and the resulting ePub.

Do you want to do this? Do you have to convert PDF > ePub?
JSWolf is online now   Reply With Quote
Old 08-12-2023, 02:24 AM   #9
michaelbr
Connoisseur
michaelbr began at the beginning.
 
michaelbr's Avatar
 
Posts: 77
Karma: 10
Join Date: Aug 2010
Location: Murcia/Spain
Device: Android 12
Quote:
Originally Posted by Pajamaman View Post
I often find abby finereader to be the best way. Convert to hrml then epub
It seems Abby is only for Windows, I gave up Windows few years back.
michaelbr is offline   Reply With Quote
Old 08-12-2023, 02:28 AM   #10
michaelbr
Connoisseur
michaelbr began at the beginning.
 
michaelbr's Avatar
 
Posts: 77
Karma: 10
Join Date: Aug 2010
Location: Murcia/Spain
Device: Android 12
Quote:
Originally Posted by AlanHK View Post
You need to be aware that there are several different kinds of PDFs. One is made by scanning text.
So these are sets of images of text. Many apps, like Adobe Acrobat, will try to OCR this and embed an invisible text layer which allows you to search for text and copy chunks of it.

Other is created by a DTP app, like InDesign. The text is actually text, not images of text, if you zoom in max it will still be smooth.

With either of these, you can get the text cursor and select all, copy and paste in Word (or equivalent). From there, save as "web page, filtered" and you get HTML, which you can import to Sigil, or Calibre.
You will have a lot more work then to clean it up.
If it was OCR text, there will be lots of errors. Spellcheck will help find many. Also you need to review the pagebreaks, which insert spurious paragraph breaks. Then rationalise the CSS.

If it's a scan document, you can also try a full OCR app like ABBYY which can export directly to HTML or some versions do epub.

And there is a command line app , "pdftohtml" which does as it says. See https://poppler.freedesktop.org/

No matter how you do it you need to invest hours at least to clean up and check.
Otherwise you will get garbage, see the awful epubs automatically created by OCR at Internet Archive.
Thanks for this detailed explanation and tips, will check them out.
michaelbr is offline   Reply With Quote
Old 08-12-2023, 02:30 AM   #11
michaelbr
Connoisseur
michaelbr began at the beginning.
 
michaelbr's Avatar
 
Posts: 77
Karma: 10
Join Date: Aug 2010
Location: Murcia/Spain
Device: Android 12
Quote:
Originally Posted by JSWolf View Post
The only way you can 100% be sure that the converted PDF is error free is to do a 100% A/B comparison of the PDF and the resulting ePub.

Do you want to do this? Do you have to convert PDF > ePub?
Are you offering your service? Or is there a tool to do that?
michaelbr is offline   Reply With Quote
Old 08-12-2023, 04:39 AM   #12
Quoth
the rook, bossing Never.
Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.
 
Quoth's Avatar
 
Posts: 11,493
Karma: 87454321
Join Date: Jun 2017
Location: Ireland
Device: All 4 Kinds: epub eink, Kindle, android eink, NxtPaper11
Probably neither.
Quoth is offline   Reply With Quote
Old 08-12-2023, 04:37 PM   #13
DNSB
Bibliophagist
DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.
 
DNSB's Avatar
 
Posts: 36,391
Karma: 145735554
Join Date: Jul 2010
Location: Vancouver
Device: Kobo Sage, Libra Colour, Lenovo M8 FHD, Paperwhite 4, Tolino epos
Quote:
Originally Posted by michaelbr View Post
Are you offering your service? Or is there a tool to do that?
The tool used to do that is your Mark 1 eyeball. I suspect that if you want someone else to use their eyeballs on your behalf, it will get expensive.

You will have to bring up both the ePub and PDF on your screen and check line by line for differences.
DNSB is offline   Reply With Quote
Old 08-12-2023, 09:08 PM   #14
retiredbiker
Addict
retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.
 
retiredbiker's Avatar
 
Posts: 389
Karma: 1638210
Join Date: May 2013
Location: Ontario, Canada
Device: Kindle KB, Oasis, Pop_Os!, Jutoh, Kobo Forma
Quote:
Originally Posted by michaelbr View Post
It seems Abby is only for Windows, I gave up Windows few years back.
I use Ubuntu.

For OCR, try OCRFeeder as a front end to tesseract. Tesseract is very accurate given a good image. I do a page at a time, defining the text area manually. It can then handle multiple columns, advertisements, "continued on page 99" and so on. OCRFeeder is very good at connecting the lines into correct paragraphs, dealing with end-of-line hyphens, and so on.

Might seem slow, but this as actually the quick part of the process...you will have to proof read and correct no matter what.

Pdftopng will get images out of pdfs, that works better than OCRing the pdf itself.

ImageMagick can tame image files that are too large and slow down tesseract. Scan Taylor Advanced and Unpaper may be useful; I find them black magic, but I use them if needed.

If you want to try and use existing text, pdftohtml will sometimes fail while pdftotext will work. No idea why. If you use the pdftotext, try the --layout option and get ready for a lot of regex to tame the spacing.
retiredbiker is offline   Reply With Quote
Old 08-12-2023, 09:45 PM   #15
retiredbiker
Addict
retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.
 
retiredbiker's Avatar
 
Posts: 389
Karma: 1638210
Join Date: May 2013
Location: Ontario, Canada
Device: Kindle KB, Oasis, Pop_Os!, Jutoh, Kobo Forma
Quote:
Originally Posted by michaelbr View Post
3 - copy/paste links/fotos manually
the step 3 requires a lot of work if there are lot of links/photos, anyone knows any tools/tips to make it automatic or easier than manually?
ps: OS MX Linux
If you use LibreOffice Writer to edit the text, do the links for footnotes, endnotes and so on there. They will convert nicely into epub using the Calibre conversion. I find it much easier to make the links in Writer rather than in the epub editor.

Dealing with epub images is a pain and depends on your audience...what do they use for readers? The most basic advice for physical ereaders like Kindle or Kobo is to declare the width as a percent and the height auto in the CSS...don't use absolute units. Then surround the <img...> line with a <p> or <div> to make it center/right/left. Like this:

<p class="center">
<img alt="" class="widepic" src="../Images/c02.jpg"/>
</p>

where "widepic" says in the CSS:

.widepic {
height: auto;
width: 98%;
}

And "center" is just that:

.center {
text-align: center;
text-indent: 0;
margin: .5em 0 .5em 0;
}

So you can pull the images into the Writer version, and apply something like this at the epub editor stage. I can usually replace what the conversion does with this using some simple regex. This is almost always readable on various readers or apps..
retiredbiker is offline   Reply With Quote
Reply

Tags
epub, pdf conversion, tip, tool tips


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Large book with a 2-column structure, PDF -> EPUB .. any tips on this one? cow Conversion 4 03-09-2022 04:29 PM
A little help converting ePUB to PDF GuilleCrK Conversion 8 01-07-2019 11:05 AM
Ultimate PDF to Epub/Mobi conversion tips sinan Workshop 43 08-01-2017 12:46 AM
converting pdf to epub Gagan ePub 65 06-28-2017 11:57 PM
Converting PDF Tips baker2gs Amazon Kindle 4 03-10-2010 10:53 PM


All times are GMT -4. The time now is 05:35 AM.


MobileRead.com is a privately owned, operated and funded community.