04-06-2019, 11:59 AM | #1 |
Member
Posts: 17
Karma: 10
Join Date: Apr 2019
Device: Android phone
|
Most accurate method to convert PDF -> ePub
Hello, give me please suggestion how to do best in converting retail PDF ebook (ie. no scanned pages as bitmap, but a true selectable text) to ePub. Tried Calibre's internal convertor but that doesn't produce satisfactory result (resulting ebook contains nonsense characters by places and seems to be missing parts of the text). Tried also ABBYY PDF Transformer, but the result is yet worse. Of course I need a readable ebook that reflows the paging and preserves the original text formatting/structure as closely as possible.
|
04-06-2019, 05:20 PM | #2 |
null operator (he/him)
Posts: 20,570
Karma: 26954694
Join Date: Mar 2012
Location: Sydney Australia
Device: none
|
In my experience there's no single most accurate method; one invariably has to do some post conversion editing.
Some folks are prepared to spend time tweaking and trying different converters to find the best, yet still imperfect, settings for individual PDFs. Whilst others, use one or two methods to do 'rough and ready' conversions and then deal with the issues such as broken paragraphs, botched ligatures etc in the output, using saved searches, epub editor plugins, addons for Word etc. I'm one of these. BR |
Advert | |
|
04-10-2019, 07:56 PM | #5 |
Guru
Posts: 970
Karma: 4999999
Join Date: Mar 2009
Location: Rosario, Argentina
Device: SONY PRS-505, PRS-T2
|
My usual workflow:
1. Crop with Briss to do away with headers and footers, if necessary. 2. Convert with Mobipocket Creator. 3. Load the HTML with Sigil. 4. Edit during endless hours. In willus' page you can find links to the Briss download page. Mobipocket Creator site is down for ever, but I suppose you can get the installer somewhere else if you look hard enough. I have yet to try loading a pdf with Word and see how it converts, as willus suggests. Problems normally found when converting pdf files: 1. Broken paragraphs, specially at page breaks. 2. Lost scene changes. 3. Lost formatting: italics, bold, font size changes. Good luck! Last edited by Pablo; 04-10-2019 at 08:04 PM. |
Advert | |
|
04-11-2019, 06:58 PM | #6 |
Grand Sorcerer
Posts: 11,470
Karma: 13095790
Join Date: Aug 2007
Location: Grass Valley, CA
Device: EB 1150, EZ Reader, Literati, iPad 2 & Air 2, iPhone 7
|
On of the great things about converting to word either by importing into word or using Acrobat that can now convert to word is that you are left with a file that is easy to edit and powerful to create CSS using styles automatically when converted to ePub. Word itself is page oriented so it is easy to compare and yet word flows easily as well.
Dale |
04-14-2019, 02:17 PM | #7 | |
Bookmaker & Cat Slave
Posts: 11,462
Karma: 158448243
Join Date: Apr 2010
Location: Phoenix, AZ
Device: K2, iPad, KFire, PPW, Voyage, NookColor. 2 Droid, Oasis, Boox Note2
|
Quote:
To this day, we still use ABBYYFineReader, and "convert" the PDF via scanning & OCR. Now...that's a lot of time, effort and money to spend, if "save as Word" genuinely worked worth a damn. The cruft underneath that's created, using ANY "save as Word" or one of the ubiquitous websites, that all pretty mjch use Calibre's API, is mind-boggling. In short, IMHO, there's no "good" way to convert a PDF to ePUB. It's a lot of steps and a lot of work. Hitch |
|
04-14-2019, 03:11 PM | #8 |
Wizard
Posts: 3,980
Karma: 38840460
Join Date: Sep 2012
Location: Minneapolis
Device: PWSE, Voyage, K3, HDX, KBasic 7 & 8, Nook Glo3, Echos, Nanos
|
I just have to say that I'm really surprised that ABBYY Transformer did a crappy job. I use that all the time to convert pdf files created from scanned books (I can't read paper books any longer) which are nearly all novels. Is what you are converting a textbook or something with lots of graphics? Cookbooks often don't convert well, for instance with all those fractions and formatting complexities.
|
04-24-2019, 02:54 PM | #9 |
Wizard
Posts: 4,520
Karma: 121692313
Join Date: Oct 2009
Location: Heemskerk, NL
Device: PRS-T1, Kobo Touch, Kobo Aura
|
As far as I know, there is only one real method. OCR (preferably with ABBYY), edit the outcome either in a Word processor as ePUB to the fullest and correct *ALL* OCR errors including missing comma's and alike. Proofread the book at least 3 times and then you will have caught most errors.
Now, my Word add-in can help you a lot in catching OCR errors, but for sure not all (unless the source is superb, which it never is). |
04-25-2019, 09:19 AM | #10 | |
Bookmaker & Cat Slave
Posts: 11,462
Karma: 158448243
Join Date: Apr 2010
Location: Phoenix, AZ
Device: K2, iPad, KFire, PPW, Voyage, NookColor. 2 Droid, Oasis, Boox Note2
|
Quote:
Hitch |
|
04-25-2019, 10:44 AM | #11 |
Resident Curmudgeon
Posts: 73,983
Karma: 128903378
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
|
There is only one accurate way to convert a novel sized PDF > ePub. That is to convert it however you want. Here comes the accurate part. You have to compare the PDF to the ePub. You have to compare every space, every punctuation mark, every word, everything to make sure the ePub matches the PDF. That's the most accurate way of converting a PDF. There is no automate way to do it. You have to do the comparing no matter how you convert.
|
04-25-2019, 04:51 PM | #12 | |
Bookmaker & Cat Slave
Posts: 11,462
Karma: 158448243
Join Date: Apr 2010
Location: Phoenix, AZ
Device: K2, iPad, KFire, PPW, Voyage, NookColor. 2 Droid, Oasis, Boox Note2
|
Quote:
Yes, pretty much, although we do our comparing BEFORE we make the ePUB. SSDD, though. The comparison has to be done, you are absolutely right about that. Hitch |
|
04-29-2019, 01:28 PM | #13 | |
Wizard
Posts: 1,086
Karma: 6719822
Join Date: Jul 2012
Device: Palm Pilot M105
|
Quote:
The current size of Adobe Reader DC on Windows 64 is 306 MB. I had no idea it was so bloated. (Not that it really matters these days with today's drive sizes, but let's get serious.) |
|
04-30-2019, 12:19 PM | #14 |
Bookmaker & Cat Slave
Posts: 11,462
Karma: 158448243
Join Date: Apr 2010
Location: Phoenix, AZ
Device: K2, iPad, KFire, PPW, Voyage, NookColor. 2 Droid, Oasis, Boox Note2
|
I wanted to add an interesting experience here, that I just dealt with, yesterday.
A longish story as short as possible--I was contacted by a prospective client, who had typed his Opus on a Brother dedicated WP, back in the mid-'90s. Had it in print, so had that scanned, a few years back. Of course, it turned out that the pdf is an image-only PDF, no text layer. I'd started to tell the client that we'd have to OCR it again, yadda, but...on a flyer, I ran it through OCR, in Acrobat Pro. Then, I exported the new PDF, which now had a text layer, to Word. And I'll be dammed, but the resulting file is NOT horrible. I mean, with a modicum of cleanup--not beyond the regular person--it could be entirely usable. I was pretty gobsmacked because the source PDF was not wonderful. It wasn't the worst I've ever seen (a scanned copy of a multiply-faxed document--that was the worst), but it wasn't crisp, either, and the pages were not wildly straight. But it worked, and the resulting Word file was not bad at all. So...there are, sometimes, shortcuts to the Abbyy scan/OCR process that can work. I would not have ever thought that they existed; in a decade, I've never seen it work before, but it did this time. I would then suggest that you at least try the shortcut methods, to see if you can pull one out of the hat, too. It's worth the 5-10 minutes' of time, compared to what the longer routes take. Hitch |
07-11-2020, 07:14 PM | #15 |
Evangelist
Posts: 425
Karma: 77256
Join Date: Sep 2011
Device: none
|
So I wonder who has tried more or less every program to export PDF to Word/HTML and compared results? Of course, corrections are a lot of work, but I am looking to minimize certain things:
1. diacritics. Some publishers use combining marks, and so far I've found only Acrobat handles them correctly. The best of the commercial PDF -> Word converters I've tried insert spaces in such words. 2. hyphenation - PDF2Office is the only app I know of that can attempt to remove hyphenation with a dictionary. 3. paragraph spacing. Let's say some paragraph style has some spacing above, such as 1 or half a line. Export results so far from all I can remember vary in margin settings, e.g., a paragraph that has half a line of space above, might have a top-margin of 5pt, 6pt or anything in between or close. I can end up with a zillion unique paragraph styles, making it difficult to fix. Even Acrobat HTML and Word export vary with different results, with HTML using whole numbers in inline-CSS making it somewhat easier to correct. Struggling with this now, as I have a reference work I'm trying to export, formatted similar to a dictionary, a set amount of space above each entry. Yet there are also many other paragraphs with the same style of space above so using some regex such as new entry begins with bold-italic isn't reliable. Acrobat too can sometimes make errors and make such paragraphs have no top margin. 4. columns. I haven't tried this in a while, but perhaps some apps might not reliably separate left and right margins, mixing them together by line from top to bottom, left to right. In the PDF I'm trying to currently convert, top page headers with entry name and page number I'd like to use regex to make EPUB 3 page numbers yet sometimes the left top-header is inserted correctly at the top of the page, and the right part with page number inserted at top of the right margin, making it useless. I haven't recently tried various other commercial PDF apps such as Nitro, Phantom, etc. Maybe those might have other issues. I suppose with each PDF, one perhaps must try them all and see which is the best in each case. |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Convert an epub to a pdf from another pdf sample file | SvenSND | Conversion | 3 | 09-02-2016 04:29 PM |
Convert epub to pdf, with notes with main text in the pdf? | 8140david | ePub | 1 | 06-18-2015 01:13 PM |
Convert epub to pdf, with notes with main text in the pdf? | 8140david | Conversion | 1 | 06-18-2015 11:02 AM |