MobileRead Forums - View Single Post - best way to convert PDF to ePUB

vastav · 08-09-2010, 05:33 PM

Quote:

Originally Posted by PatNY

I am getting the best results by far using Acrobat as an intermediate step.

Over the last week I converted a couple pdf books to epub format and the biggest problem was getting the paragraph breaks to end up right. I initially tried a straight Calibre conversion but paragraph breaks were all over the place and incorrect -- even after fiddling for quite some time with the line un-wrapping value.

Then I read this thread and the suggestions by chaley and greenapple to use Acrobat were right on the money. I tried other suggestions such as Mobipocket Creator and the pdf2epub.com converter but both resulted in body text where paragraphs all ran together in one long block!

With Acrobat, converting either to RTF or HTML gave me an almost perfect result with the body text. I convert a pdf both ways in Acrobat, then import both rtf and html into Calibre and see which conversion to epub gives the best result in the body text. In one instance it was RTF and in the other it was HTML.

After deciding which gave the best base conversion (RTF or HTML) I then imported the file into MS Word to designate chapter headings and generate a TOC. (I find it easier to do in Word than in Sigil.) Then I import into Calibre, convert to ePub, and do last minute tidying up in Sigil. Sounds like a long process, and it is, but it's much less labor intensive and problematic than trying to clean up the bad paragraph breaks left by other conversion methods.

I realize not all have or can afford Acrobat, but if you look on eBay you can sometimes find older versions on sale for a good price. There may also be some free or cheaper pdf applications that can do as clean a job as Acrobat on pdf-to-rtf/html conversions. I already had Acrobat but never realized it could be so helpful in ebook conversions.

--Pat

I hope you tried the solution at http://www.pdf2epub.com and not the similar sounding offering by dnaml. The ePub output will have same paragraph breaks as those you find in the RTF or HTML export from Acrobat. Here's why - the conversion plugins (for formats such as RTF, HTML, XML, Plain Text) that come packaged with Acrobat use the tags in the PDF to drive conversion process. If the PDF is not tagged, the first step in conversion process is to generate tags using a tag recognition technolgoy that comes with Acrobat. Once the tags are generated, a piece of content marked as paragraph will be exported as a paragraph by all conversion filters, including the ePub plugin that I supply.

At its origin Tagged PDF was primarily influenced by HTML 4.01 and CSS1.0 specifications. The Tagged PDF spec has some omissions as well as additions compared with the other two standards. I am not sure about the current state of RTF but the RTF 1.6 specification (which is exported by Acrobat 7) had some differences with Tagged PDF's styling attributes. That is why I mentioned that when you go from PDF > RTF > ePub, you will likely encounter some loss, depending on how your PDF is constructed.

For the TOC, if you use the plugin I supply, all bookmarks in PDF automatically get converted to TOC in ePub. If you have a PDF which is tagged by the authoring application, you can simply create the bookmarks in Acrobat by choosing "New bookmarks from Structure" from the top drop-down available in the bookmarks tab in Acrobat. If you have a PDF which is not tagged (you can check by opening View > Navigation Panels > Tags), you should create the bookmarks manually in Acrobat before running the conversion filter for HTML/ RTF/ ePub to ensure that bookmarks get exported in a valid manner in the exported file.

If you have Acrobat on your system, I would suggest using the ePub plugin available on my site versus the web-based solution. The help documentation provides details on using the plugin. If you like the RTF/ HTML export from Acrobat, there is good chance that you will also like the ePub export. I will be happy to help resolve any issues that you may find.