Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Formats > Workshop

Notices

Reply
 
Thread Tools Search this Thread
Old 04-06-2019, 11:59 AM   #1
Ubiquity
Member
Ubiquity began at the beginning.
 
Posts: 17
Karma: 10
Join Date: Apr 2019
Device: Android phone
Most accurate method to convert PDF -> ePub

Hello, give me please suggestion how to do best in converting retail PDF ebook (ie. no scanned pages as bitmap, but a true selectable text) to ePub. Tried Calibre's internal convertor but that doesn't produce satisfactory result (resulting ebook contains nonsense characters by places and seems to be missing parts of the text). Tried also ABBYY PDF Transformer, but the result is yet worse. Of course I need a readable ebook that reflows the paging and preserves the original text formatting/structure as closely as possible.
Ubiquity is offline   Reply With Quote
Old 04-06-2019, 05:20 PM   #2
BetterRed
null operator (he/him)
BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.
 
Posts: 20,570
Karma: 26954694
Join Date: Mar 2012
Location: Sydney Australia
Device: none
In my experience there's no single most accurate method; one invariably has to do some post conversion editing.

Some folks are prepared to spend time tweaking and trying different converters to find the best, yet still imperfect, settings for individual PDFs.

Whilst others, use one or two methods to do 'rough and ready' conversions and then deal with the issues such as broken paragraphs, botched ligatures etc in the output, using saved searches, epub editor plugins, addons for Word etc. I'm one of these.

BR
BetterRed is offline   Reply With Quote
Advert
Old 04-10-2019, 08:41 AM   #3
willus
Fuzzball, the purple cat
willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.
 
willus's Avatar
 
Posts: 1,273
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
My page on the topic.
willus is offline   Reply With Quote
Old 04-10-2019, 07:52 PM   #4
Pablo
Guru
Pablo ought to be getting tired of karma fortunes by now.Pablo ought to be getting tired of karma fortunes by now.Pablo ought to be getting tired of karma fortunes by now.Pablo ought to be getting tired of karma fortunes by now.Pablo ought to be getting tired of karma fortunes by now.Pablo ought to be getting tired of karma fortunes by now.Pablo ought to be getting tired of karma fortunes by now.Pablo ought to be getting tired of karma fortunes by now.Pablo ought to be getting tired of karma fortunes by now.Pablo ought to be getting tired of karma fortunes by now.Pablo ought to be getting tired of karma fortunes by now.
 
Pablo's Avatar
 
Posts: 970
Karma: 4999999
Join Date: Mar 2009
Location: Rosario, Argentina
Device: SONY PRS-505, PRS-T2
Quote:
Originally Posted by willus View Post
My page on the topic.
Interesting page! Thank you.
Pablo is offline   Reply With Quote
Old 04-10-2019, 07:56 PM   #5
Pablo
Guru
Pablo ought to be getting tired of karma fortunes by now.Pablo ought to be getting tired of karma fortunes by now.Pablo ought to be getting tired of karma fortunes by now.Pablo ought to be getting tired of karma fortunes by now.Pablo ought to be getting tired of karma fortunes by now.Pablo ought to be getting tired of karma fortunes by now.Pablo ought to be getting tired of karma fortunes by now.Pablo ought to be getting tired of karma fortunes by now.Pablo ought to be getting tired of karma fortunes by now.Pablo ought to be getting tired of karma fortunes by now.Pablo ought to be getting tired of karma fortunes by now.
 
Pablo's Avatar
 
Posts: 970
Karma: 4999999
Join Date: Mar 2009
Location: Rosario, Argentina
Device: SONY PRS-505, PRS-T2
My usual workflow:

1. Crop with Briss to do away with headers and footers, if necessary.
2. Convert with Mobipocket Creator.
3. Load the HTML with Sigil.
4. Edit during endless hours.

In willus' page you can find links to the Briss download page. Mobipocket Creator site is down for ever, but I suppose you can get the installer somewhere else if you look hard enough.

I have yet to try loading a pdf with Word and see how it converts, as willus suggests.

Problems normally found when converting pdf files:

1. Broken paragraphs, specially at page breaks.
2. Lost scene changes.
3. Lost formatting: italics, bold, font size changes.

Good luck!

Last edited by Pablo; 04-10-2019 at 08:04 PM.
Pablo is offline   Reply With Quote
Advert
Old 04-11-2019, 06:58 PM   #6
DaleDe
Grand Sorcerer
DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.
 
DaleDe's Avatar
 
Posts: 11,470
Karma: 13095790
Join Date: Aug 2007
Location: Grass Valley, CA
Device: EB 1150, EZ Reader, Literati, iPad 2 & Air 2, iPhone 7
On of the great things about converting to word either by importing into word or using Acrobat that can now convert to word is that you are left with a file that is easy to edit and powerful to create CSS using styles automatically when converted to ePub. Word itself is page oriented so it is easy to compare and yet word flows easily as well.

Dale
DaleDe is offline   Reply With Quote
Old 04-14-2019, 02:17 PM   #7
Hitch
Bookmaker & Cat Slave
Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.
 
Hitch's Avatar
 
Posts: 11,462
Karma: 158448243
Join Date: Apr 2010
Location: Phoenix, AZ
Device: K2, iPad, KFire, PPW, Voyage, NookColor. 2 Droid, Oasis, Boox Note2
Quote:
Originally Posted by DaleDe View Post
On of the great things about converting to word either by importing into word or using Acrobat that can now convert to word is that you are left with a file that is easy to edit and powerful to create CSS using styles automatically when converted to ePub. Word itself is page oriented so it is easy to compare and yet word flows easily as well.

Dale
BUT...if that really worked, commercial firms like mine would do that. And we don't.

To this day, we still use ABBYYFineReader, and "convert" the PDF via scanning & OCR. Now...that's a lot of time, effort and money to spend, if "save as Word" genuinely worked worth a damn.

The cruft underneath that's created, using ANY "save as Word" or one of the ubiquitous websites, that all pretty mjch use Calibre's API, is mind-boggling.

In short, IMHO, there's no "good" way to convert a PDF to ePUB. It's a lot of steps and a lot of work.

Hitch
Hitch is offline   Reply With Quote
Old 04-14-2019, 03:11 PM   #8
Tarana
Wizard
Tarana ought to be getting tired of karma fortunes by now.Tarana ought to be getting tired of karma fortunes by now.Tarana ought to be getting tired of karma fortunes by now.Tarana ought to be getting tired of karma fortunes by now.Tarana ought to be getting tired of karma fortunes by now.Tarana ought to be getting tired of karma fortunes by now.Tarana ought to be getting tired of karma fortunes by now.Tarana ought to be getting tired of karma fortunes by now.Tarana ought to be getting tired of karma fortunes by now.Tarana ought to be getting tired of karma fortunes by now.Tarana ought to be getting tired of karma fortunes by now.
 
Tarana's Avatar
 
Posts: 3,980
Karma: 38840460
Join Date: Sep 2012
Location: Minneapolis
Device: PWSE, Voyage, K3, HDX, KBasic 7 & 8, Nook Glo3, Echos, Nanos
I just have to say that I'm really surprised that ABBYY Transformer did a crappy job. I use that all the time to convert pdf files created from scanned books (I can't read paper books any longer) which are nearly all novels. Is what you are converting a textbook or something with lots of graphics? Cookbooks often don't convert well, for instance with all those fractions and formatting complexities.
Tarana is offline   Reply With Quote
Old 04-24-2019, 02:54 PM   #9
Toxaris
Wizard
Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.
 
Toxaris's Avatar
 
Posts: 4,520
Karma: 121692313
Join Date: Oct 2009
Location: Heemskerk, NL
Device: PRS-T1, Kobo Touch, Kobo Aura
As far as I know, there is only one real method. OCR (preferably with ABBYY), edit the outcome either in a Word processor as ePUB to the fullest and correct *ALL* OCR errors including missing comma's and alike. Proofread the book at least 3 times and then you will have caught most errors.

Now, my Word add-in can help you a lot in catching OCR errors, but for sure not all (unless the source is superb, which it never is).
Toxaris is offline   Reply With Quote
Old 04-25-2019, 09:19 AM   #10
Hitch
Bookmaker & Cat Slave
Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.
 
Hitch's Avatar
 
Posts: 11,462
Karma: 158448243
Join Date: Apr 2010
Location: Phoenix, AZ
Device: K2, iPad, KFire, PPW, Voyage, NookColor. 2 Droid, Oasis, Boox Note2
Quote:
Originally Posted by Toxaris View Post
As far as I know, there is only one real method. OCR (preferably with ABBYY), edit the outcome either in a Word processor as ePUB to the fullest and correct *ALL* OCR errors including missing comma's and alike. Proofread the book at least 3 times and then you will have caught most errors.

Now, my Word add-in can help you a lot in catching OCR errors, but for sure not all (unless the source is superb, which it never is).
And that is exactly right. We've NEVER found a better way, and God knows, I wish we could. It's onerous to have to do that for every bloody book that shows up at my shop, in PDF, but...there is no faster, easier, and MORE ACCURATE way than that. To get all three, you're stuck with scanning & OCR.

Hitch
Hitch is offline   Reply With Quote
Old 04-25-2019, 10:44 AM   #11
JSWolf
Resident Curmudgeon
JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.
 
JSWolf's Avatar
 
Posts: 73,983
Karma: 128903378
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
There is only one accurate way to convert a novel sized PDF > ePub. That is to convert it however you want. Here comes the accurate part. You have to compare the PDF to the ePub. You have to compare every space, every punctuation mark, every word, everything to make sure the ePub matches the PDF. That's the most accurate way of converting a PDF. There is no automate way to do it. You have to do the comparing no matter how you convert.
JSWolf is offline   Reply With Quote
Old 04-25-2019, 04:51 PM   #12
Hitch
Bookmaker & Cat Slave
Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.
 
Hitch's Avatar
 
Posts: 11,462
Karma: 158448243
Join Date: Apr 2010
Location: Phoenix, AZ
Device: K2, iPad, KFire, PPW, Voyage, NookColor. 2 Droid, Oasis, Boox Note2
Quote:
Originally Posted by JSWolf View Post
There is only one accurate way to convert a novel sized PDF > ePub. That is to convert it however you want. Here comes the accurate part. You have to compare the PDF to the ePub. You have to compare every space, every punctuation mark, every word, everything to make sure the ePub matches the PDF. That's the most accurate way of converting a PDF. There is no automate way to do it. You have to do the comparing no matter how you convert.
Wolfie:

Yes, pretty much, although we do our comparing BEFORE we make the ePUB. SSDD, though. The comparison has to be done, you are absolutely right about that.

Hitch
Hitch is offline   Reply With Quote
Old 04-29-2019, 01:28 PM   #13
lumpynose
Wizard
lumpynose ought to be getting tired of karma fortunes by now.lumpynose ought to be getting tired of karma fortunes by now.lumpynose ought to be getting tired of karma fortunes by now.lumpynose ought to be getting tired of karma fortunes by now.lumpynose ought to be getting tired of karma fortunes by now.lumpynose ought to be getting tired of karma fortunes by now.lumpynose ought to be getting tired of karma fortunes by now.lumpynose ought to be getting tired of karma fortunes by now.lumpynose ought to be getting tired of karma fortunes by now.lumpynose ought to be getting tired of karma fortunes by now.lumpynose ought to be getting tired of karma fortunes by now.
 
Posts: 1,086
Karma: 6719822
Join Date: Jul 2012
Device: Palm Pilot M105
Quote:
Originally Posted by willus View Post
My page on the topic.
Thanks for the link to Sumatra. I used to install an alternative to Adobe's Reader but never found one I really liked.

The current size of Adobe Reader DC on Windows 64 is 306 MB. I had no idea it was so bloated. (Not that it really matters these days with today's drive sizes, but let's get serious.)
lumpynose is offline   Reply With Quote
Old 04-30-2019, 12:19 PM   #14
Hitch
Bookmaker & Cat Slave
Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.
 
Hitch's Avatar
 
Posts: 11,462
Karma: 158448243
Join Date: Apr 2010
Location: Phoenix, AZ
Device: K2, iPad, KFire, PPW, Voyage, NookColor. 2 Droid, Oasis, Boox Note2
I wanted to add an interesting experience here, that I just dealt with, yesterday.

A longish story as short as possible--I was contacted by a prospective client, who had typed his Opus on a Brother dedicated WP, back in the mid-'90s. Had it in print, so had that scanned, a few years back. Of course, it turned out that the pdf is an image-only PDF, no text layer.

I'd started to tell the client that we'd have to OCR it again, yadda, but...on a flyer, I ran it through OCR, in Acrobat Pro. Then, I exported the new PDF, which now had a text layer, to Word.

And I'll be dammed, but the resulting file is NOT horrible. I mean, with a modicum of cleanup--not beyond the regular person--it could be entirely usable. I was pretty gobsmacked because the source PDF was not wonderful. It wasn't the worst I've ever seen (a scanned copy of a multiply-faxed document--that was the worst), but it wasn't crisp, either, and the pages were not wildly straight. But it worked, and the resulting Word file was not bad at all.

So...there are, sometimes, shortcuts to the Abbyy scan/OCR process that can work. I would not have ever thought that they existed; in a decade, I've never seen it work before, but it did this time. I would then suggest that you at least try the shortcut methods, to see if you can pull one out of the hat, too. It's worth the 5-10 minutes' of time, compared to what the longer routes take.

Hitch
Hitch is offline   Reply With Quote
Old 07-11-2020, 07:14 PM   #15
democrite
Evangelist
democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.
 
Posts: 425
Karma: 77256
Join Date: Sep 2011
Device: none
So I wonder who has tried more or less every program to export PDF to Word/HTML and compared results? Of course, corrections are a lot of work, but I am looking to minimize certain things:

1. diacritics. Some publishers use combining marks, and so far I've found only Acrobat handles them correctly. The best of the commercial PDF -> Word converters I've tried insert spaces in such words.

2. hyphenation - PDF2Office is the only app I know of that can attempt to remove hyphenation with a dictionary.

3. paragraph spacing. Let's say some paragraph style has some spacing above, such as 1 or half a line. Export results so far from all I can remember vary in margin settings, e.g., a paragraph that has half a line of space above, might have a top-margin of 5pt, 6pt or anything in between or close. I can end up with a zillion unique paragraph styles, making it difficult to fix. Even Acrobat HTML and Word export vary with different results, with HTML using whole numbers in inline-CSS making it somewhat easier to correct.

Struggling with this now, as I have a reference work I'm trying to export, formatted similar to a dictionary, a set amount of space above each entry. Yet there are also many other paragraphs with the same style of space above so using some regex such as new entry begins with bold-italic isn't reliable. Acrobat too can sometimes make errors and make such paragraphs have no top margin.

4. columns. I haven't tried this in a while, but perhaps some apps might not reliably separate left and right margins, mixing them together by line from top to bottom, left to right. In the PDF I'm trying to currently convert, top page headers with entry name and page number I'd like to use regex to make EPUB 3 page numbers yet sometimes the left top-header is inserted correctly at the top of the page, and the right part with page number inserted at top of the right margin, making it useless.

I haven't recently tried various other commercial PDF apps such as Nitro, Phantom, etc. Maybe those might have other issues. I suppose with each PDF, one perhaps must try them all and see which is the best in each case.
democrite is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Convert an epub to a pdf from another pdf sample file SvenSND Conversion 3 09-02-2016 04:29 PM
Convert epub to pdf, with notes with main text in the pdf? 8140david ePub 1 06-18-2015 01:13 PM
Convert epub to pdf, with notes with main text in the pdf? 8140david Conversion 1 06-18-2015 11:02 AM


All times are GMT -4. The time now is 02:24 AM.


MobileRead.com is a privately owned, operated and funded community.