Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Conversion

Notices

Reply
 
Thread Tools Search this Thread
Old 02-21-2014, 06:49 AM   #1
exaltedwombat
Guru
exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.
 
Posts: 878
Karma: 2457540
Join Date: Nov 2011
Device: none
PDF to EPUB

Big project coming up, producing EPUB from publication-ready PDF. No chance of accessing the source files, INDD or whatever.

Current workflow:

Trim off chapter headers, page numbers etc. in Acrobat (followed by Remove hidden information to REALLY remove the data). Often amenable to a degree of automation.

Convert in Calibre. Generally a good job done on the actual text, but despite the best attempts of Heuristics there will still be a lot of spurious paragraph breaks to check and edit.

Collate footnotes (there are LOTS of footnotes) to a section at the end of each chapter and construct hyperlinks to and from them. Any ideas for making this easier?

Reset illustrations (not too many of these).

Include the original Index section, but with a note "please consider this a wordlist for use with your reader's Search function".

Charge accordingly to the work done!

Any tips and tricks anyone can offer? Thanks.
exaltedwombat is offline   Reply With Quote
Old 02-25-2014, 12:04 PM   #2
sumguy
Connoisseur
sumguy can extract oil from cheesesumguy can extract oil from cheesesumguy can extract oil from cheesesumguy can extract oil from cheesesumguy can extract oil from cheesesumguy can extract oil from cheesesumguy can extract oil from cheesesumguy can extract oil from cheesesumguy can extract oil from cheese
 
Posts: 57
Karma: 1186
Join Date: Jun 2012
Device: none
Calibre's PDF conversion is awful, in particular the "Heuristics" for unwrapping text just based on line length is basically unusable. Try Mobipocket Creator instead, it does a much better job, and can be used with Sigil to make EPUBs. It's really worthwhile to learn Sigil rather than struggling with Calibre to author EPUBs.

[edit: I just learned that Calibre has a new book editor module, that's meant to provide a replacement for Sigil, which isn't being developed anymore. I haven't tried it yet, but it could be a good alternative.]

My workflow is import PDF into Mobipocket Creator, and then just quit without doing anything else. Grab the resulting HTML file & images and import into Sigil. Clean it up by hand and/or regular expressions, add table of contents and cover, etc. Much better results than Calibre, though still a lot of hand editing to do often, Mobipocket does make some irritating mistakes, particularly with links and footnotes.

Last edited by sumguy; 03-02-2014 at 08:07 AM.
sumguy is offline   Reply With Quote
Advert
Old 02-25-2014, 04:38 PM   #3
eschwartz
Ex-Helpdesk Junkie
eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.
 
eschwartz's Avatar
 
Posts: 19,421
Karma: 85400180
Join Date: Nov 2012
Location: The Beaten Path, USA, Roundworld, This Side of Infinity
Device: Kindle Touch fw5.3.7 (Wifi only)
Quote:
Originally Posted by sumguy View Post
Calibre's PDF conversion is awful, in particular the "Heuristics" for unwrapping text just based on line length is basically unusable. Try Mobipocket Creator instead, it does a much better job, and can be used with Sigil to make EPUBs. It's really worthwhile to learn Sigil rather than struggling with Calibre to author EPUBs.

My workflow is import PDF into Mobipocket Creator, and then just quit without doing anything else. Grab the resulting HTML file & images and import into Sigil. Clean it up by hand and/or regular expressions, add table of contents and cover, etc. Much better results than Calibre, though still a lot of hand editing to do often, Mobipocket does make some irritating mistakes, particularly with links and footnotes.
That's really doing it by hand vs doing automated processes to save time. Not calibre's fault... It can fix a lot of mistakes, but it will never be perfect. There's a lot of options to control how to make the attempt to derive meaning, and different PDFs will yield different results.

And you can use calibre's Edit Book to the same effect as Sigil once you have you EPUB. Saves having to install two programs, and gets a lot more attention to bugfixes nowadays, although it doesn't yet have spellcheck.
eschwartz is offline   Reply With Quote
Old 02-25-2014, 06:49 PM   #4
exaltedwombat
Guru
exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.
 
Posts: 878
Karma: 2457540
Join Date: Nov 2011
Device: none
Thanks for the interest.

Yes, there's always going to be an element of "doing it by hand" - we've all seen the results when automation gas been completely relied on :-)

I'll certainly try Mobipocket Creator. With Calibre I'm getting nearly faultless text transfer (some ligatures, particularly ff, seem to fool it). My main job is checking through for extra paragraph breaks. Do you feel Mobipocket Creator will make a better job of these, sumguy?
exaltedwombat is offline   Reply With Quote
Old 02-25-2014, 08:33 PM   #5
BetterRed
null operator (he/him)
BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.
 
Posts: 21,662
Karma: 29711016
Join Date: Mar 2012
Location: Sydney Australia
Device: none
Quote:
Originally Posted by exaltedwombat View Post
Thanks for the interest.

Yes, there's always going to be an element of "doing it by hand" - we've all seen the results when automation gas been completely relied on :-)

I'll certainly try Mobipocket Creator. With Calibre I'm getting nearly faultless text transfer (some ligatures, particularly ff, seem to fool it). My main job is checking through for extra paragraph breaks. Do you feel Mobipocket Creator will make a better job of these, sumguy?
@exaltedwombat - My default is try Mobipocket creator first - look at the PRC and decide if its worth doing this

I have Word macros to take care of the common things, broken paragraphs, ligatures, page footers etc. I convert the PRC to RTF in calibre, read the RTF into Word, run the macros, save as DOCX and convert that to EPUB. If its 'near' enough then I'll open the DOCX in Word and apply some styles, get rid of all tabs and superfluous newlines etc. If necessary I do fine tuning with the new calibre Editor or Sigil.

If its not I might have a go with calibre or PDF Nitro, although most often I'll decide its not worth the effort and settle for only having the original PDF.

I don't even try converting complex PDF 'books' with embedded tables, graphs, sidebars etc.

BR
BetterRed is offline   Reply With Quote
Advert
Old 02-27-2014, 09:41 PM   #6
exaltedwombat
Guru
exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.
 
Posts: 878
Karma: 2457540
Join Date: Nov 2011
Device: none
Just had a chance to do some experimenting. Yes, Mobipocket Creator seems to make a much better guess at the paragraph breaks than Calibre does. I haven't had to work from PDF much before. The project is already looking a lot more manageable!
exaltedwombat is offline   Reply With Quote
Old 03-02-2014, 09:08 AM   #7
sumguy
Connoisseur
sumguy can extract oil from cheesesumguy can extract oil from cheesesumguy can extract oil from cheesesumguy can extract oil from cheesesumguy can extract oil from cheesesumguy can extract oil from cheesesumguy can extract oil from cheesesumguy can extract oil from cheesesumguy can extract oil from cheese
 
Posts: 57
Karma: 1186
Join Date: Jun 2012
Device: none
Quote:
Originally Posted by eschwartz View Post
That's really doing it by hand vs doing automated processes to save time. Not calibre's fault... It can fix a lot of mistakes, but it will never be perfect. There's a lot of options to control how to make the attempt to derive meaning, and different PDFs will yield different results.
My experience is that Mobipocket Creator is as close to "automated" as it gets (it's the same technology that Amazon uses to convert when you e-mail a PDF to your Kindle). My needs are just to convert PDFs to EPUBs that are "good enough" to read on my reader, not to create perfectly polished publications, so of course other peoples' requirements may be different.

Armed with enough Word macros or Calibre regex's, you can accomplish similar results in terms of removing headers & footers, page numbers, etc., probably with more fine control. But often they need to be customized for each particular book, which is also a lot of "by hand" work. Mobipocket Creator does a suprisingly good job of doing all that automatically, at least compared to any other software out there. The downside is that you don't have any control over the rules it uses, so if there are mistakes you need to use something else to buff them out. Of course that can also be done with macros or regex's, so you still save a huge amount of time by letting Creator do the first pass. I'm only talking about "standard" books here, anything with complicated layout is going to be a big task no matter what...

Calibre is awesome at many things, but PDF conversion isn't one of its strong points. What I find most annoying is the text unwrapping, and that certainly is Calibre's fault. The algorithm it uses is quite simplistic, if a line is less than xx% of the page width, it's considered a paragraph break, if it's longer, it's not. So in a typical book, you end up with hundreds of incorrect paragraph breaks - spurious breaks that shouldn't be there, and paragraphs stuck together that shouldn't be. At that point it becomes extremely difficult to fix automatically with macros or regex's, because the original information is already lost. With Creator, that's rarely a problem. I don't know how they do it, but it's miles ahead of Calibre, and even stitches things together correctly across page breaks, footnotes, and so on, which is a lifesaver. No, it's not perfect, and it does tend to make mistakes with long lists like an index or list of footnotes. But again - miles ahead of Calibre in terms of doing a "good enough" job in the least amount of time.

By the way, BetterRed, you don't necessarily need to go through the whole process of making the PRC (mobi) in Creator, and then converting that to something else. As soon as you import a PDF, it makes a folder with an HTML file and the associated image files. You can just quit out of Creator then, even without saving, and grab the HTML folder. From there you can use whatever method you like, eg. import into Word, use an HTML editor, or import directly into Sigil or the Calibre book editor.

Last edited by sumguy; 03-02-2014 at 09:20 AM.
sumguy is offline   Reply With Quote
Old 03-02-2014, 10:11 AM   #8
exaltedwombat
Guru
exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.
 
Posts: 878
Karma: 2457540
Join Date: Nov 2011
Device: none
Some of these books, the customer's just going to have to be told 'no way - leave it as PDF. One's just come up - PDF page graphics with no readable text content,and plenty of quotes in Greek and Hebrew! Life's too short... :-) MobiPocket just spat out a set of PNG graphics, one for each page.

I love the way MobiPocket makes a folder with HTML file and an Images folder containing any illustrations - drag the HTML onto your Sigil icon, the images come across automatically!
The discontinued development of Sigil is a great loss. Though I suppose it already covers just about everything you CAN do in EPUB 2, and we'll be making books in that format for a good time yet!
exaltedwombat is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
ePub to pdf: Doesn't respect soft hyphens in ePub EbokJunkie Conversion 4 11-18-2013 03:27 AM
PDF Margins on Epub to PDF viker Conversion 3 04-02-2012 12:18 AM


All times are GMT -4. The time now is 09:23 AM.


MobileRead.com is a privately owned, operated and funded community.