MobileRead Forums - View Single Post

Tex2002ans · 08-03-2023, 09:23 AM

Quote:

Originally Posted by michaelbr

I understand that PDF is not very friendly when converting to ePub, [...]

For PDF->EPUB, see my recent posts in:

2023: "From print to ePub - how I did it."

Quote:

Originally Posted by michaelbr

[...] and the result requires a lot of work.

Yep. It always will.

PDF is meant as a final output format, NOT as any sort of easy-to-use input format.

You might be able to create an "okay" rough ebook very quickly... especially if it's very basic like a Fiction book with only text + a few chapter headings.

But if you are dealing with anything complicated, or a crappier scan... converting PDF requires much more elbow grease.

Quote:

Originally Posted by michaelbr

1 - open PDF file in browser (Brave in my case)
2 - copy the text from browser and paste into an ePub editing app (Sigil)

No. I would not trust this with a ten foot pole.

PDFs have two layers, imagine it like a front/back:

Image Layer
Text Layer

The image layer is the visible stuff you can see—like the yellowed paper, dust, etc., when a book is scanned in.

The text layer is the hidden stuff—this is where the OCR + text is placed. (This is how you can search through a document for certain words, highlight sections of the book, etc.)

Quite often, what you see with your own two eyes IS NOT what the PDF's text actually says underneath.

- - -

For example, just see the Archive.org book files I linked in this post:

2020: "Optimize PDFs from archive.org for E-Ink devices"
- Compare the original Archive.org PDF/EPUB to my very-quick EPUB.

You can see the ENORMOUS difference in quality+formatting just from a few tweaks I did.

- - -

Side Note: Even if the PDF was purely digital, like a document you saved straight out of LibreOffice/Word... I probably wouldn't trust a copy/paste out of PDF.

Depending on how someone saved the PDF, there's a ton of nasty stuff that can occur, like:

Hard hyphens
Ligatures
Headers/Footers
Broken paragraph breaks
Paragraphs out of order

which would be broken + carried over when trying to do a simple copy from PDF -> paste into a program.

This is why Calibre has all those PDF settings to try to detect/undo a lot of that garbage "automatically"... but even that can't help you in many cases.

Quote:

Originally Posted by michaelbr

the step 3 requires a lot of work if there are lot of links/photos, anyone knows any tools/tips to make it automatic or easier than manually?

Try to get the book/document in any other format, if possible. Like if you are working with an author, get the original DOCX file.

If the PDF is the only thing that exists, then you can convert it as a last resort, but know that there's no easy, quick, one-button push to do it.

With PDF, there's no avoiding it... You'll need to put in the work to get a perfect ebook out the other end!