Is there a pdf->epub option that keeps the design better

NovelFan · 11-10-2023, 02:07 PM

I noticed that there are the ePub versions and there are the PDF versions of books.

And if I convert the PDF version into the ePub version using calibre, then formatting is mostly gone except for a few bold and italic parts, also fonts are ignored, it's always the generic font. Also images are tiny and not optimized or out of place.

Compare that to an originally designed epub, how much better it looks if I have the pure ePub version.

But I have some documents as PDF.

Is there a more intelligent converter that retains design much more?

Karellen · 11-10-2023, 02:16 PM

There's not much you can do to improve pdf conversions.

This might explain why... https://www.mobileread.com/forums/sh...d.php?t=118605

NovelFan · 11-10-2023, 02:22 PM

Quote:

Originally Posted by Karellen

There's not much you can do to improve pdf conversions.

This might explain why... https://www.mobileread.com/forums/sh...d.php?t=118605

A gui to guide calibre could solve all problems. You simply mark parts as pagination, as headings, and it automatically marks all similar ones like the automatic table detection in tabula, which also works with pdf.

It just needs some guidance, after all, most authors stick to a pattern in their book, shown by formatting and frames and all you need to do is define what is what to then use similarity search for you to proof read before conversion quickly.

Indeed, as a I wrote the opensource tabula could be of great help here.

Tex2002ans · 11-10-2023, 02:53 PM

Quote:

Originally Posted by NovelFan

And if I convert the PDF version into the ePub version using calibre, then formatting is mostly gone except for a few bold and italic parts, also fonts are ignored, it's always the generic font. Also images are tiny and not optimized or out of place.

PDF is the absolute WORST format to convert.

PDF is meant as an output-only format—not as an input into anything else.

As you can see, you get LOTS of pain and junk carried over if you try to "one-button push" convert PDFs.

To convert PDF into a proper ebook requires lots of elbow grease.

Quote:

Originally Posted by NovelFan

Is there a more intelligent converter that retains design much more?

Yes, you need to use an actual OCR program... like ABBYY Finereader.

Quote:

Originally Posted by NovelFan

A gui to guide calibre could solve all problems. You simply mark parts as pagination, as headings, and it automatically marks all similar ones like the automatic table detection in tabula, which also works with pdf.

It just needs some guidance, after all, most authors stick to a pattern in their book, shown by formatting and frames and all you need to do is define what is what to then use similarity search for you to proof read before conversion quickly.

That's exactly what Finereader (or some other the other OCR tools) do.

It automatically marks:

Sections
- Headers/Footers
- Tables/Images
- Footnotes
- [...]
Formatting
- Bold/Italics/Smallcaps
- Superscript/Subscript
- [...]
"Unsure Characters"
- In blue highlight (IMAGE), so you can focus on the potential error spots.

Then, it allows you to:

See a side-by-side+magnified comparison of original vs. OCRed text.
- Allows you to quickly compare and make your corrections.

If you want even more knowledge...

I've extensively explained PDF->ebook workflows over the past 12 years. Most recently a few months ago in:

2023: "From print to ePub - how I did it."

NovelFan · 11-10-2023, 03:12 PM

The books already have a text layer, I don't need OCR.

PS:
"I convert ebooks professionally."
How does one make money with that?

DNSB · 11-10-2023, 04:55 PM

Quote:

Originally Posted by NovelFan

The books already have a text layer, I don't need OCR.

PS:
"I convert ebooks professionally."
How does one make money with that?

Umm... in a lot of cases, the text layer is done by OCR to allow searching. If you extract the text layer, in >90% of the PDFs I've looked at, it is total crap and will require way too much work for me to do unpaid. Even with PDFs that are text based, the conversion tends to leave a lot of artifacts which need to be manually cleaned up. Items such as kerned letter pairs and ligatures tend to have a habit of disappearing with some conversions (suddenly pallet becomes pa et for instance).

j.p.s · 11-10-2023, 05:14 PM

Quote:

Originally Posted by NovelFan

The books already have a text layer, I don't need OCR.

But the text layer has little or no formatting or semantic information.

Quote:

PS:
"I convert ebooks professionally."
How does one make money with that?

Because it requires skill, knowledge, and a lot of work that some are willing or to pay for or need to pay for in order for a book to be produced.

You think what you want is easy. Why don't you just go ahead and do it yourself?

Quoth · 11-11-2023, 06:18 AM

Export or copy/past text layer to Word/LO Writer and edit, then proof.
What Tex2000ans, Karellen, j.p.s. and DNSB write.

I actually convert a PROPERLY Styled docx to epub in Calibre without ANY editing of CSS (except images CSS after final proof of text) and then proof read / annotate on a Kobo eink.

PDFs are only a source for old PD that's only been scanned and OCRed by someone else. Madness for anything else, except piracy.

Tex2002ans · 11-11-2023, 08:16 AM

Quote:

Originally Posted by NovelFan

The books already have a text layer, I don't need OCR.

No. If you take a closer look, it's extremely likely to be:

Missing all formatting information.
- All the italics, bold, etc. So you'll only get the raw plaintext itself.
- + Many paragraph breaks, especially if they cross pages.
An OCR that happened automatically, full of typos/errors.
- As one example, see Archive.org's fully-automated "EPUBs" vs. an EPUB I created quickly using better tools/settings.

Quote:

Originally Posted by DNSB

Umm... in a lot of cases, the text layer is done by OCR to allow searching. If you extract the text layer, in >90% of the PDFs I've looked at, it is total crap and will require way too much work for me to do unpaid. Even with PDFs that are text based, the conversion tends to leave a lot of artifacts which need to be manually cleaned up. [...]

Yes, exactly. Also, they may have been run with old/obsolete OCR tools, so you'd get MUCH MORE ACCURATE text if you run it through some of the latest tools.

For example, they might have run the PDF through:

OCR PROGRAM V1 from 2008

But you can run it on:

OCR PROGRAM V12 from 2023

Much more accurate OCR means MUCH less time fixing up all the errors and junk in your exported file.

Quote:

Originally Posted by NovelFan

PS:
"I convert ebooks professionally."
How does one make money with that?

Working with authors/publishers:

Digitizing the backlog
Proper conversions of newly published books.
Cleaning/Updating old/junky conversions.
- Especially if authors get KQNs (Kindle Quality Notices).
Bringing text up to the latest standards + best practices.
- Like proper Tables code for Text-to-Speech!

I also go above and beyond:

Proofreading/Copyediting
- So however Editors make money...
Typesetting/Typography
- Creating high-quality PDFs/documents.
Maintaining the backlogs.
- + Making sure the HTML can go on their websites as articles.
Training
- So you can more efficiently use the tools you already have.

For a little more info on the general reasons why you might want a pro converting or looking over your book... I wrote these comments last year:

JSWolf · 11-11-2023, 01:07 PM

Quote:

Originally Posted by NovelFan

I noticed that there are the ePub versions and there are the PDF versions of books.

And if I convert the PDF version into the ePub version using calibre, then formatting is mostly gone except for a few bold and italic parts, also fonts are ignored, it's always the generic font. Also images are tiny and not optimized or out of place.

Compare that to an originally designed epub, how much better it looks if I have the pure ePub version.

But I have some documents as PDF.

Is there a more intelligent converter that retains design much more?

You pick a program and convert the PDF> ePub. Then you take the PDF and ePub and A/B convert everything. All spaces, all punctuation, all text. Then when you have that sorted, you use either Sigil or calibre's editor to make sure the code and formatting looks good. It's a lot of work. Regex to fix PDF > ePub does not work to fix everything and could botch things. This is a manual job.

jackie_w · 11-11-2023, 02:43 PM

Quote:

Originally Posted by Tex2002ans

No. If you take a closer look, it's extremely likely to be:

Missing all formatting information.
- All the italics, bold, etc. So you'll only get the raw plaintext itself.
- + Many paragraph breaks, especially if they cross pages.
An OCR that happened automatically, full of typos/errors.

I'm not sure I agree with these generalisations. Admittedly it is a long time (10+ years) since I made a concerted effort to "convert" my own (relatively few) PDFs to good HTML but I was finally able to do it without losing italics/bold, headings and scenebreaks. Maybe I was just lucky but the text layer in my PDFs was excellent, i.e. no indication the the text was the result of auto-OCR.

However, the effort involved was huge and the best partial solution I could come up with was to create a series of self-programmed interactive "assistant" utilities to semi-automate the process. Most PDFs I converted introduced some kind of new challenge that I hadn't seen previously.

Having experienced the challenges involved first-hand, my conclusion was that I don't think it's possible to create a magic one-click solution that would work for all PDFs. I didn't waste my time even trying to convert multi-column or non-fiction documents with many tables or footnotes. Too hard! Fiction novels only.

I won't bore everyone with great detail but I found that the best method for Step 1 of the whole process was to find a utility which would extract the PDF text as an XML tree. This had the benefit of retaining a lot more info about each text snippet extracted, e.g. font used and (x,y) position on page. Unfortunately the drawback to this was that I had to create my own logic for rearranging the text snippets into correct reading order and identifying paragraph starts/ends. The font used can help identify chapter headings, italic/bold, dropcaps, small-caps. The (x,y) position can help identify correct reading order, paragraph starts, scenebreaks and those unwanted PDF headers/footers.

P.S. There's no way I would offer this as a service-for-hire unless I could pick and choose the PDFs I was prepared to tackle. It was an interesting (and bloody-minded) personal project but not something I'd want to do on a regular basis.

Tex2002ans · 11-11-2023, 04:26 PM

Quote:

Originally Posted by jackie_w

I'm not sure I agree with these generalisations. Admittedly it is a long time (10+ years) since I made a concerted effort to "convert" my own (relatively few) PDFs to good HTML but I was finally able to do it without losing italics/bold, headings and scenebreaks. Maybe I was just lucky but the text layer in my PDFs was excellent, i.e. no indication the the text was the result of auto-OCR.

If you're lucky enough to be working with fresh, digital documents... then they were perhaps spit directly out of Word/InDesign with the proper checkboxes pushed!

Nowadays, there's a bigger push for:

Tagged PDFs

which are a HUGE step in the right direction.

(This attaches important information—Heading/Paragraph + Bold/Italic + Headers/Footers/PageNumbers—into the PDF too, so tools like Text-to-Speech can step through the document and navigate correctly.)

Theoretically, I suspect this sort of PDF->EPUB conversion easier... but you'd still probably be better off going through a known toolchain, instead of trying to unravel who-knows-what-unique-garbage-is-buried-in-that-PDF.

- - -

But again, PDF is a final OUTPUT format... it's an absolutely trash INPUT format—so should only be used as a very last resort.

And, as always, if possible, it's always best to go back to the original source document (DOCX/ODT, RTF, TXT, ...) and convert from there.

- - -

Quote:

Originally Posted by jackie_w

Most PDFs I converted introduced some kind of new challenge that I hadn't seen previously.

Yep, each PDF would be like a completely new black box. :P

Similar happens when people try to do EPUB->EPUB, unraveling all the spaghetti of HTML+CSS someone created.

Most of the time, it's faster/easier to just go back to the drawing board and restart your conversion from scratch.

You could see some of that described in this post, where I explained to RbnJrg how I'd handle "surgically correcting" 20 ebooks in the same series:

2021: "Adding a limited Automate Feature To Sigil"

- - -

Side Note: Since 2021, KevinH has since implemented many of those theoretical features into Sigil!

Advanced cleanup tools, but EXTREMELY powerful ways to mass fix HTML+CSS much more quickly.

- - -

Quote:

Originally Posted by jackie_w

I didn't waste my time even trying to convert multi-column or non-fiction documents with many tables or footnotes. Too hard! Fiction novels only.

lol. And that's almost always where I dabble—Non-Fiction + all the hard stuff.

Right now, I'm working on an ebook with 1400 Endnotes!*

And the 2nd ebook has ~190 Figures! (Ugh, that amount of alt text generation... kill me now... lol.)

- - -

* Side Note: Speaking of Endnotes...

Does anybody here know how to wrestle Microsoft Word into outputting:

Endnotes per chapter.
An Endnotes section being placed NOT at the very very end of the document, but near-the-end.
- So imagine it'll be at page 290/300, with 10 pages of author/publisher backmatter afterwards

- - -

Quote:

Originally Posted by jackie_w

I won't bore everyone with great detail but I found that the best method for Step 1 of the whole process was to find a utility which would extract the PDF text as an XML tree. This had the benefit of retaining a lot more info about each text snippet extracted, [...] identify correct reading order, paragraph starts, scenebreaks and those unwanted PDF headers/footers.

Hmmm... isn't that already what Calibre is doing when it's using its heuristics?

Then, you just have a spaghetti nest of Calibre-converted classes to mess with, but that would be infinitely easier than this custom PDF-exporting+parsing+converting stuff. (See the "surgical" thread/methods above.)

Quote:

Originally Posted by jackie_w

P.S. There's no way I would offer this as a service-for-hire unless I could pick and choose the PDFs I was prepared to tackle. It was an interesting (and bloody-minded) personal project but not something I'd want to do on a regular basis.

lol. Of course. And then once you tell them how much $$$ your crazy method would take...

Instead, you can just:

Toss PDF right into Finereader.
- It takes care of all that Formatting + Font Size + guessing for you.
It spits out relatively damn good DOCX/EPUB with consistent output.
- So you know its quirks + can automate a lot of the fixes.

And boom, with so much less work, now, you can layer the book's unique quirks on top of THAT BASE document.

Easier to take that clean-but-not-quite-correct text and:

Manually correct the OCR errors.
+ Layer your Headings / Chapter Breaks / Tables on top of that.

than to:

Create some custom PDF-conversion thing for every book.
Spend hours manually mapping all the fonts/bold/italics/whatever.
Disentangling every "treasure" of wrenches in that PDF that weren't found anywhere else...
- Like every word being wrapped in its own <span>s + miles and miles of near-duplicate-but-slightly-different CSS.
[...]

Learn from me + Hitch—the pros—lol.

Trying to convert from PDF like that is a dark, dark, path!

DNSB · 11-11-2023, 06:19 PM

The last time I converted more than one PDF in a batch was a paying gig where an author had gotten rights to her books back but the only copies she had was PDFs the publisher had sent her years back. I did the conversion to docx with basic cleanup. She then pulled them into Word and did the edits and more cleanup before sending back to me for checking formatting before republishing them.

I suspect I undercharged her but my wife loved her books.

jackie_w · 11-11-2023, 08:11 PM

Quote:

Originally Posted by Tex2002ans

Hmmm... isn't that already what Calibre is doing when it's using its heuristics?

Possibly, yes, I think calibre's pdftohtml Poppler utility does have an XML output option, but it's not the utility I ended up using.

IIRC, when I originally experimented with calibre's PDF to EPUB conversion I had difficulties with the first couple of PDFs I tried. One of them failed to retain italics, the other did detect italics but failed to retain scenebreaks. For both of them, trying to remove PDF headers/footers via the convert-search/replace option was a PITA. Maybe I was just unlucky with my choice of PDFs but all 3 of those problems were showstoppers for me so I didn't pursue one-click PDF conversion any further. This was over 10 years ago, maybe it's better now

... but based on the OP's first post, maybe not.

Quote:

Originally Posted by Tex2002ans

Trying to convert from PDF like that is a dark, dark, path!

Yes, but it is a well-trodden path which is already laid

No point re-inventing the wheel if it works well enough for the occasional new PDF as I only convert for personal use or sometimes as a favour for a friend. FWIW the main reason for my original post was to say that "where there's a will there's a way" but hoping for a one-size-fits-all "magic button" is likely to end in disappointment.

Tex2002ans · 11-12-2023, 02:07 AM

Quote:

Originally Posted by DNSB

The last time I converted more than one PDF in a batch was a paying gig where an author had gotten rights to her books back but the only copies she had was PDFs the publisher had sent her years back. I did the conversion to docx with basic cleanup. She then pulled them into Word and did the edits and more cleanup before sending back to me for checking formatting before republishing them.

I suspect I undercharged her but my wife loved her books.

Nice.

Yes, a lot of the work I do is also where the original files are completely lost. Think 1990s or 2000s digital publishing. The author/publisher might not even HAVE the original source files anymore... so the PDF (or physical book) is the only file left.

People/organizations are very bad at backing up important files.

For example, see video games from before 2000:

Ars Technica: "Saving video gaming’s source code treasures before it’s too late" (January 5, 2021)

Quote:

Originally Posted by jackie_w

FWIW the main reason for my original post was to say that "where there's a will there's a way" but hoping for a one-size-fits-all "magic button" is likely to end in disappointment.

Yep. Full agree on that.

11-10-2023, 02:07 PM	#1
NovelFan Always reading something Posts: 4 Karma: 10 Join Date: Nov 2023 Location: In neverland Device: tolino Shine 3	Is there a pdf->epub option that keeps the design better I noticed that there are the ePub versions and there are the PDF versions of books. And if I convert the PDF version into the ePub version using calibre, then formatting is mostly gone except for a few bold and italic parts, also fonts are ignored, it's always the generic font. Also images are tiny and not optimized or out of place. Compare that to an originally designed epub, how much better it looks if I have the pure ePub version. But I have some documents as PDF. Is there a more intelligent converter that retains design much more?

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
PDF to EPUB Size Issue (is PDF to CBZ an option?)	Rika24	Conversion	4	06-30-2016 01:51 AM
how do I request option to convert from epub not original-epub ?	cybmole	Conversion	11	10-08-2014 12:44 PM
Cover for In Design EPUB	SteveC100	Sigil	12	04-29-2011 01:09 PM
Chapters option after convert pdf or lit into epub	silverdezz	Kobo Reader	2	02-28-2011 01:08 PM
Thanks for the PDF Option!!!	Hitch	Calibre	4	06-30-2010 07:26 PM

11-10-2023, 02:16 PM	#2
Karellen Wizard Posts: 1,103 Karma: 4911876 Join Date: Sep 2021 Location: Australia Device: Kobo Libra 2	There's not much you can do to improve pdf conversions. This might explain why... https://www.mobileread.com/forums/sh...d.php?t=118605

11-10-2023, 03:12 PM	#5
NovelFan Always reading something Posts: 4 Karma: 10 Join Date: Nov 2023 Location: In neverland Device: tolino Shine 3	The books already have a text layer, I don't need OCR. PS: "I convert ebooks professionally." How does one make money with that?

11-11-2023, 06:18 AM	#8
Quoth the rook, bossing Never. Posts: 11,164 Karma: 85874891 Join Date: Jun 2017 Location: Ireland Device: All 4 Kinds: epub eink, Kindle, android eink, NxtPaper11	Export or copy/past text layer to Word/LO Writer and edit, then proof. What Tex2000ans, Karellen, j.p.s. and DNSB write. I actually convert a PROPERLY Styled docx to epub in Calibre without ANY editing of CSS (except images CSS after final proof of text) and then proof read / annotate on a Kobo eink. PDFs are only a source for old PD that's only been scanned and OCRed by someone else. Madness for anything else, except piracy.

11-11-2023, 06:19 PM	#13
DNSB Bibliophagist Posts: 35,464 Karma: 145525534 Join Date: Jul 2010 Location: Vancouver Device: Kobo Sage, Forma, Clara HD, Lenovo M8 FHD, Paperwhite 4, Tolino epos	The last time I converted more than one PDF in a batch was a paying gig where an author had gotten rights to her books back but the only copies she had was PDFs the publisher had sent her years back. I did the conversion to docx with basic cleanup. She then pulled them into Word and did the edits and more cleanup before sending back to me for checking formatting before republishing them. I suspect I undercharged her but my wife loved her books.

Advert

Advert