08-03-2023, 02:35 AM | #1 |
Connoisseur
Posts: 77
Karma: 10
Join Date: Aug 2010
Location: Murcia/Spain
Device: Android 12
|
converting PDF to ePub tips
I understand that PDF is not very friendly when converting to ePub, I've tried few online converters (seems most use Calibre) and the result requires a lot of work. It seems browser can open easily a PDF file keeping the formatting, just wondering if anyone tried to convert PDF to ePub usinig the following method:
1 - open PDF file in browser (Brave in my case) 2 - copy the text from browser and paste into an ePub editing app (Sigil) 3 - copy/paste links/fotos manually the step 3 requires a lot of work if there are lot of links/photos, anyone knows any tools/tips to make it automatic or easier than manually? ps: OS MX Linux |
08-03-2023, 12:27 PM | #2 |
Bibliophagist
Posts: 36,391
Karma: 145735554
Join Date: Jul 2010
Location: Vancouver
Device: Kobo Sage, Libra Colour, Lenovo M8 FHD, Paperwhite 4, Tolino epos
|
Michael, please note that posting multiple copies of the same message is not permitted on MobileRead. Posting the same message in this forum and the Sigil forum is unlikely to get you any extra help and is annoying to those who find themselves starting to reply before realizing that they already replied.
|
08-03-2023, 07:23 PM | #3 |
Evangelist
Posts: 483
Karma: 2267928
Join Date: Nov 2015
Device: none
|
What I found very useful was using LibreOffice Draw to convert PDF to a flat ODG document, which can be easily converted to a markup language like XML.
This is the only sure way to work around paragraphs, fonts, and page breaks. Automatic tools often fail with those. |
08-04-2023, 01:25 PM | #4 | |
Connoisseur
Posts: 77
Karma: 10
Join Date: Aug 2010
Location: Murcia/Spain
Device: Android 12
|
Quote:
|
|
08-04-2023, 01:26 PM | #5 | |
Connoisseur
Posts: 77
Karma: 10
Join Date: Aug 2010
Location: Murcia/Spain
Device: Android 12
|
Quote:
|
|
08-04-2023, 02:10 PM | #6 |
Wizard
Posts: 2,827
Karma: 10700629
Join Date: May 2016
Location: Canada
Device: Onyx Nova
|
|
08-06-2023, 06:27 AM | #7 |
Guru
Posts: 668
Karma: 929286
Join Date: Apr 2014
Device: PW-3, iPad, Android phone
|
You need to be aware that there are several different kinds of PDFs. One is made by scanning text.
So these are sets of images of text. Many apps, like Adobe Acrobat, will try to OCR this and embed an invisible text layer which allows you to search for text and copy chunks of it. Other is created by a DTP app, like InDesign. The text is actually text, not images of text, if you zoom in max it will still be smooth. With either of these, you can get the text cursor and select all, copy and paste in Word (or equivalent). From there, save as "web page, filtered" and you get HTML, which you can import to Sigil, or Calibre. You will have a lot more work then to clean it up. If it was OCR text, there will be lots of errors. Spellcheck will help find many. Also you need to review the pagebreaks, which insert spurious paragraph breaks. Then rationalise the CSS. If it's a scan document, you can also try a full OCR app like ABBYY which can export directly to HTML or some versions do epub. And there is a command line app , "pdftohtml" which does as it says. See https://poppler.freedesktop.org/ No matter how you do it you need to invest hours at least to clean up and check. Otherwise you will get garbage, see the awful epubs automatically created by OCR at Internet Archive. |
08-10-2023, 05:58 PM | #8 |
Resident Curmudgeon
Posts: 74,505
Karma: 129668758
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
|
The only way you can 100% be sure that the converted PDF is error free is to do a 100% A/B comparison of the PDF and the resulting ePub.
Do you want to do this? Do you have to convert PDF > ePub? |
08-12-2023, 02:24 AM | #9 |
Connoisseur
Posts: 77
Karma: 10
Join Date: Aug 2010
Location: Murcia/Spain
Device: Android 12
|
|
08-12-2023, 02:28 AM | #10 | |
Connoisseur
Posts: 77
Karma: 10
Join Date: Aug 2010
Location: Murcia/Spain
Device: Android 12
|
Quote:
|
|
08-12-2023, 02:30 AM | #11 |
Connoisseur
Posts: 77
Karma: 10
Join Date: Aug 2010
Location: Murcia/Spain
Device: Android 12
|
|
08-12-2023, 04:39 AM | #12 |
the rook, bossing Never.
Posts: 11,493
Karma: 87454321
Join Date: Jun 2017
Location: Ireland
Device: All 4 Kinds: epub eink, Kindle, android eink, NxtPaper11
|
Probably neither.
|
08-12-2023, 04:37 PM | #13 |
Bibliophagist
Posts: 36,391
Karma: 145735554
Join Date: Jul 2010
Location: Vancouver
Device: Kobo Sage, Libra Colour, Lenovo M8 FHD, Paperwhite 4, Tolino epos
|
The tool used to do that is your Mark 1 eyeball. I suspect that if you want someone else to use their eyeballs on your behalf, it will get expensive.
You will have to bring up both the ePub and PDF on your screen and check line by line for differences. |
08-12-2023, 09:08 PM | #14 | |
Addict
Posts: 389
Karma: 1638210
Join Date: May 2013
Location: Ontario, Canada
Device: Kindle KB, Oasis, Pop_Os!, Jutoh, Kobo Forma
|
Quote:
For OCR, try OCRFeeder as a front end to tesseract. Tesseract is very accurate given a good image. I do a page at a time, defining the text area manually. It can then handle multiple columns, advertisements, "continued on page 99" and so on. OCRFeeder is very good at connecting the lines into correct paragraphs, dealing with end-of-line hyphens, and so on. Might seem slow, but this as actually the quick part of the process...you will have to proof read and correct no matter what. Pdftopng will get images out of pdfs, that works better than OCRing the pdf itself. ImageMagick can tame image files that are too large and slow down tesseract. Scan Taylor Advanced and Unpaper may be useful; I find them black magic, but I use them if needed. If you want to try and use existing text, pdftohtml will sometimes fail while pdftotext will work. No idea why. If you use the pdftotext, try the --layout option and get ready for a lot of regex to tame the spacing. |
|
08-12-2023, 09:45 PM | #15 | |
Addict
Posts: 389
Karma: 1638210
Join Date: May 2013
Location: Ontario, Canada
Device: Kindle KB, Oasis, Pop_Os!, Jutoh, Kobo Forma
|
Quote:
Dealing with epub images is a pain and depends on your audience...what do they use for readers? The most basic advice for physical ereaders like Kindle or Kobo is to declare the width as a percent and the height auto in the CSS...don't use absolute units. Then surround the <img...> line with a <p> or <div> to make it center/right/left. Like this: <p class="center"> <img alt="" class="widepic" src="../Images/c02.jpg"/> </p> where "widepic" says in the CSS: .widepic { height: auto; width: 98%; } And "center" is just that: .center { text-align: center; text-indent: 0; margin: .5em 0 .5em 0; } So you can pull the images into the Writer version, and apply something like this at the epub editor stage. I can usually replace what the conversion does with this using some simple regex. This is almost always readable on various readers or apps.. |
|
Tags |
epub, pdf conversion, tip, tool tips |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Large book with a 2-column structure, PDF -> EPUB .. any tips on this one? | cow | Conversion | 4 | 03-09-2022 04:29 PM |
A little help converting ePUB to PDF | GuilleCrK | Conversion | 8 | 01-07-2019 11:05 AM |
Ultimate PDF to Epub/Mobi conversion tips | sinan | Workshop | 43 | 08-01-2017 12:46 AM |
converting pdf to epub | Gagan | ePub | 65 | 06-28-2017 11:57 PM |
Converting PDF Tips | baker2gs | Amazon Kindle | 4 | 03-10-2010 10:53 PM |