View Full Version : Desperately seeking.... advice on epub conversion?


Direct Ebooks
10-28-2009, 02:42 PM
Hi Guys.

I am interested in converting pdf files into Ebooks. I am currently outsourcing this work which is proving costly. Although the quality is excellent to be fair.

I would like to be able to do this work myself, but I'm unsure of what the best method is and I dont want to start on one path to find out another is better. I'm asking for your advice/experiences.

I want to produce proffessional quality epubs, fully indexed etc. I'm very familiar with all the technical issues, but should I use one of these automated packages or start at the beginning with xtml/css?
Or is adobe inDesign the best overall tool?

I'd appreciate all your comments and feedback

HarryT
10-28-2009, 02:45 PM
Moved to the appropriate forum section.

charleski
10-29-2009, 01:27 AM
I've tried various methods, but the one that has proved most successful for me is to edit the text in Word (making proper use of styles etc) and then use Atlantis to generate the ePub file. You could use Atlantis for everything, but I find Word 2007 easier to use for editing (mostly because I'm used to it and I have it anyway). The $35 for Atlantis is very reasonable for what it offers. The advantage to this is that I can edit the text in a word processor and don't have to guess what it will look like or fiddle with xml to set things up properly.

I have access to inDesign CS4 and have tried it for ePubs, and frankly it's inferior to Atlantis - you need to split the book into separate documents yourself in order to ensure that it doesn't go over the mobileADE 300k limit, which is just the sort of extra hassle I can do without. inDesign does offer more flexibility with ToC generation, has options for image manipulation and makes it easy to embed fonts in the ePub, but none of these justify the extra effort involved unless you have special needs.

I'm sure you can get excellent results converting rtf files in calibre as well, which has the benefit of being free. For best results you'll probably want to tweak the css settings and XPath options for the ToC etc. There are also a few free add-ons to Word floating around that might be worth checking out, though they tend to enforce their own particular notions and can be fiddly (hard to moan when they're free though).

One thing you need to realise is that PDFs are a real pain to convert. Very few are fully tagged, meaning that you need to scan through the text to correct broken paragraphs and incorrectly inserted line breaks or hyphens (I use a Word wildcard search for paragraphs, Find: ([!."\?\!\)])^13 Replace:\1 though you still need to check each instance). Each document will offer its own variation of the particular problems you can run into. I'm afraid there is no 1-click solution, converting a PDF can easily take a couple of hours, or much more depending on how much you need to reconform the text. A lot depends on how much variation there is in your text and how much you want to preserve that in the finished item. There are various options for saving the PDF as a docx file for editing. I happen to use Nuance PDF Converter, which generally does a decent job of stripping out headers and footers, though it can still trip up at times.

nomesque
10-29-2009, 01:32 AM
Do you have the source files for the PDFs?

Direct Ebooks
10-29-2009, 06:02 AM
Hi Guys.

Thanks for the advice.
Publishers present me with PDF/Quark files of their books and I then outsource their conversion to ePub.
The biggest challenge i feel will be converting the pdf's back to word/text file.
Are any of the auto programs any good for this?

Thanks again

WillAdams
10-29-2009, 08:11 AM
Don't start from the .pdfs --- instead use the Quark source.

Dump to XPress Tags or .html or some other sort of tagged format, then massage that, adding back in anything which wasn't in the main text flow (or get a specialized XTension/utility such as textractor).

PDFs convert the formatting into localized text changes and positional information which is difficult to extract. If you must use a .pdf as a source, use a utility such as Marcel Weiher's TextLightning.app which will analyze that positional information and then allow you to use global search-replace techniques to convert the local-formatting into proper styles.

William

wallcraft
10-29-2009, 10:21 AM
The only book on a similar subject I am aware of is Kindle Formatting: The Complete Guide (http://www.mobileread.com/forums/showthread.php?t=44141). This is probably worth buying for anyone intending to format multiple ebooks. I don't remember if it discusses starting with PDFs though. Since this is somewhat Kindle-specific, there probably is room for a similar "ePub Formatting" book.

Chang
10-30-2009, 06:48 AM
I'm facing the same problems as Direct Ebooks. I have PDF documents as source files and I need some easy way to edit them. One program is MS Word from where I can easily take it to InDesign and create an e-book. Problem is the conversion from PDF to DOC. I checked TextLightning.app which WillAdams mentioned but it's only for Mac and I have Win XP. I searched from google for "pdf to doc converter" but most of the softwares I found are shareware. I also tried to find open source software from http://www.sourceforge.net but didn't find any. For now, I have just found this http://www.somepdf.com/downloads.html which is free. I tried it but I'm facing new problems with it.

As you can see from the "pdf_sample.gif" file, there are 2 hyphens which just tell to the reader that the word is continueing to next row. If I copy&paste those words manually to notepad, hyphens will disappear and the words are showing correctly but the line feed is wrong as you can see from the "notepad_sample.gif".

When I use Some PDF tool to convert PDF to DOC, it leaves all the hyphens and the words are showing incorrectly as you can see from the "word_sample.gif". I should check all the hyphens manually because sometimes those are necessary. I can't just use find&replace and erase all the hyphens. Also, line feed creates sometimes one extra space between words so some words have hyphen and one empty space. That means I really need to check every case manually to see if there is hyphen or hyphen and empty space.

Problem is: either I check all the hyphens manually or every line feed. Both options are very troublesome to do manually for books with hundreds of pages. I'm using MS Word to make few styles and then export that DOC file to InDesign and create an e-book. Can you recommend some programs to ease my working process or any other suggestions to make it easier?

JSWolf
10-30-2009, 11:59 AM
Hi Guys.

I am interested in converting pdf files into Ebooks.

What worked for me the last time I did it was to use Adobe Acrobat Professional to convert the PDF. Then you have to take the converted file and the PDF and carefully compare them. That is the only way to do it withiut ending up with a file full of errors.

But why not start with the source that was used to create the PDF?

charleski
10-30-2009, 03:58 PM
Chang: going by the output you provide I wouldn't bother trying to get SomePDF to work properly. If it can't even handle hyphens correctly it's not worth using.

If you're looking for a free program, have you tried Mobipocket creator? You can use that to convert a PDF to html, and from some brief tests it seems that it respects tags reasonably well. Tagged paragraphs that are not separated with a blank line are simply given a break tag at the end, which shows up as a manual line-break in Word, but a simple search-replace is all that's needed to convert those back into paragraph marks. It also doesn't get confused by hypens (as long as they're soft hyphens, which any decent PDF-creation program should use for words that are split at line breaks).

I wouldn't worry about ragged line-ends such as the ones you show in notepad-sample.gif. You're creating reflowable text and the reader will handle the line lengths when it lays out the eBook.
As I said before, a lot depends on whether the PDF was properly tagged when it was initially created. If it wasn't, then there's no magic program to help and nothing for it but to go through the text and correct it by hand.

Timoleon
11-02-2009, 04:03 PM
For a free solution you might try Book Designer. You'll need to clean up the output a bit, but after you do, save it to a lit file and then convert that over to an epub file using Calibre.

JSWolf
11-03-2009, 11:19 AM
Hi Guys.

Thanks for the advice.
Publishers present me with PDF/Quark files of their books and I then outsource their conversion to ePub.
The biggest challenge i feel will be converting the pdf's back to word/text file.
Are any of the auto programs any good for this?

Thanks again
Use the Quark file and export it to HTML if possible and go from there. Using the PDF is going to be no end of hassle.