Thread: PDF to EPUB
View Single Post
Old 03-02-2014, 09:08 AM   #7
sumguy
Connoisseur
sumguy can extract oil from cheesesumguy can extract oil from cheesesumguy can extract oil from cheesesumguy can extract oil from cheesesumguy can extract oil from cheesesumguy can extract oil from cheesesumguy can extract oil from cheesesumguy can extract oil from cheesesumguy can extract oil from cheese
 
Posts: 57
Karma: 1186
Join Date: Jun 2012
Device: none
Quote:
Originally Posted by eschwartz View Post
That's really doing it by hand vs doing automated processes to save time. Not calibre's fault... It can fix a lot of mistakes, but it will never be perfect. There's a lot of options to control how to make the attempt to derive meaning, and different PDFs will yield different results.
My experience is that Mobipocket Creator is as close to "automated" as it gets (it's the same technology that Amazon uses to convert when you e-mail a PDF to your Kindle). My needs are just to convert PDFs to EPUBs that are "good enough" to read on my reader, not to create perfectly polished publications, so of course other peoples' requirements may be different.

Armed with enough Word macros or Calibre regex's, you can accomplish similar results in terms of removing headers & footers, page numbers, etc., probably with more fine control. But often they need to be customized for each particular book, which is also a lot of "by hand" work. Mobipocket Creator does a suprisingly good job of doing all that automatically, at least compared to any other software out there. The downside is that you don't have any control over the rules it uses, so if there are mistakes you need to use something else to buff them out. Of course that can also be done with macros or regex's, so you still save a huge amount of time by letting Creator do the first pass. I'm only talking about "standard" books here, anything with complicated layout is going to be a big task no matter what...

Calibre is awesome at many things, but PDF conversion isn't one of its strong points. What I find most annoying is the text unwrapping, and that certainly is Calibre's fault. The algorithm it uses is quite simplistic, if a line is less than xx% of the page width, it's considered a paragraph break, if it's longer, it's not. So in a typical book, you end up with hundreds of incorrect paragraph breaks - spurious breaks that shouldn't be there, and paragraphs stuck together that shouldn't be. At that point it becomes extremely difficult to fix automatically with macros or regex's, because the original information is already lost. With Creator, that's rarely a problem. I don't know how they do it, but it's miles ahead of Calibre, and even stitches things together correctly across page breaks, footnotes, and so on, which is a lifesaver. No, it's not perfect, and it does tend to make mistakes with long lists like an index or list of footnotes. But again - miles ahead of Calibre in terms of doing a "good enough" job in the least amount of time.

By the way, BetterRed, you don't necessarily need to go through the whole process of making the PRC (mobi) in Creator, and then converting that to something else. As soon as you import a PDF, it makes a folder with an HTML file and the associated image files. You can just quit out of Creator then, even without saving, and grab the HTML folder. From there you can use whatever method you like, eg. import into Word, use an HTML editor, or import directly into Sigil or the Calibre book editor.

Last edited by sumguy; 03-02-2014 at 09:20 AM.
sumguy is offline   Reply With Quote