![]() |
#1 |
Guru
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 878
Karma: 2457540
Join Date: Nov 2011
Device: none
|
PDF to EPUB
Big project coming up, producing EPUB from publication-ready PDF. No chance of accessing the source files, INDD or whatever.
Current workflow: Trim off chapter headers, page numbers etc. in Acrobat (followed by Remove hidden information to REALLY remove the data). Often amenable to a degree of automation. Convert in Calibre. Generally a good job done on the actual text, but despite the best attempts of Heuristics there will still be a lot of spurious paragraph breaks to check and edit. Collate footnotes (there are LOTS of footnotes) to a section at the end of each chapter and construct hyperlinks to and from them. Any ideas for making this easier? Reset illustrations (not too many of these). Include the original Index section, but with a note "please consider this a wordlist for use with your reader's Search function". Charge accordingly to the work done! Any tips and tricks anyone can offer? Thanks. |
![]() |
![]() |
![]() |
#2 |
Connoisseur
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 57
Karma: 1186
Join Date: Jun 2012
Device: none
|
Calibre's PDF conversion is awful, in particular the "Heuristics" for unwrapping text just based on line length is basically unusable. Try Mobipocket Creator instead, it does a much better job, and can be used with Sigil to make EPUBs. It's really worthwhile to learn Sigil rather than struggling with Calibre to author EPUBs.
[edit: I just learned that Calibre has a new book editor module, that's meant to provide a replacement for Sigil, which isn't being developed anymore. I haven't tried it yet, but it could be a good alternative.] My workflow is import PDF into Mobipocket Creator, and then just quit without doing anything else. Grab the resulting HTML file & images and import into Sigil. Clean it up by hand and/or regular expressions, add table of contents and cover, etc. Much better results than Calibre, though still a lot of hand editing to do often, Mobipocket does make some irritating mistakes, particularly with links and footnotes. Last edited by sumguy; 03-02-2014 at 08:07 AM. |
![]() |
![]() |
Advert | |
|
![]() |
#3 | |
Ex-Helpdesk Junkie
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 19,421
Karma: 85400180
Join Date: Nov 2012
Location: The Beaten Path, USA, Roundworld, This Side of Infinity
Device: Kindle Touch fw5.3.7 (Wifi only)
|
Quote:
And you can use calibre's Edit Book to the same effect as Sigil once you have you EPUB. Saves having to install two programs, and gets a lot more attention to bugfixes nowadays, although it doesn't yet have spellcheck. |
|
![]() |
![]() |
![]() |
#4 |
Guru
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 878
Karma: 2457540
Join Date: Nov 2011
Device: none
|
Thanks for the interest.
Yes, there's always going to be an element of "doing it by hand" - we've all seen the results when automation gas been completely relied on :-) I'll certainly try Mobipocket Creator. With Calibre I'm getting nearly faultless text transfer (some ligatures, particularly ff, seem to fool it). My main job is checking through for extra paragraph breaks. Do you feel Mobipocket Creator will make a better job of these, sumguy? |
![]() |
![]() |
![]() |
#5 | |
null operator (he/him)
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 21,662
Karma: 29711016
Join Date: Mar 2012
Location: Sydney Australia
Device: none
|
Quote:
I have Word macros to take care of the common things, broken paragraphs, ligatures, page footers etc. I convert the PRC to RTF in calibre, read the RTF into Word, run the macros, save as DOCX and convert that to EPUB. If its 'near' enough then I'll open the DOCX in Word and apply some styles, get rid of all tabs and superfluous newlines etc. If necessary I do fine tuning with the new calibre Editor or Sigil. If its not I might have a go with calibre or PDF Nitro, although most often I'll decide its not worth the effort and settle for only having the original PDF. I don't even try converting complex PDF 'books' with embedded tables, graphs, sidebars etc. BR |
|
![]() |
![]() |
Advert | |
|
![]() |
#6 |
Guru
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 878
Karma: 2457540
Join Date: Nov 2011
Device: none
|
Just had a chance to do some experimenting. Yes, Mobipocket Creator seems to make a much better guess at the paragraph breaks than Calibre does. I haven't had to work from PDF much before. The project is already looking a lot more manageable!
|
![]() |
![]() |
![]() |
#7 | |
Connoisseur
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 57
Karma: 1186
Join Date: Jun 2012
Device: none
|
Quote:
Armed with enough Word macros or Calibre regex's, you can accomplish similar results in terms of removing headers & footers, page numbers, etc., probably with more fine control. But often they need to be customized for each particular book, which is also a lot of "by hand" work. Mobipocket Creator does a suprisingly good job of doing all that automatically, at least compared to any other software out there. The downside is that you don't have any control over the rules it uses, so if there are mistakes you need to use something else to buff them out. Of course that can also be done with macros or regex's, so you still save a huge amount of time by letting Creator do the first pass. I'm only talking about "standard" books here, anything with complicated layout is going to be a big task no matter what... Calibre is awesome at many things, but PDF conversion isn't one of its strong points. What I find most annoying is the text unwrapping, and that certainly is Calibre's fault. The algorithm it uses is quite simplistic, if a line is less than xx% of the page width, it's considered a paragraph break, if it's longer, it's not. So in a typical book, you end up with hundreds of incorrect paragraph breaks - spurious breaks that shouldn't be there, and paragraphs stuck together that shouldn't be. At that point it becomes extremely difficult to fix automatically with macros or regex's, because the original information is already lost. With Creator, that's rarely a problem. I don't know how they do it, but it's miles ahead of Calibre, and even stitches things together correctly across page breaks, footnotes, and so on, which is a lifesaver. No, it's not perfect, and it does tend to make mistakes with long lists like an index or list of footnotes. But again - miles ahead of Calibre in terms of doing a "good enough" job in the least amount of time. By the way, BetterRed, you don't necessarily need to go through the whole process of making the PRC (mobi) in Creator, and then converting that to something else. As soon as you import a PDF, it makes a folder with an HTML file and the associated image files. You can just quit out of Creator then, even without saving, and grab the HTML folder. From there you can use whatever method you like, eg. import into Word, use an HTML editor, or import directly into Sigil or the Calibre book editor. Last edited by sumguy; 03-02-2014 at 09:20 AM. |
|
![]() |
![]() |
![]() |
#8 |
Guru
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 878
Karma: 2457540
Join Date: Nov 2011
Device: none
|
Some of these books, the customer's just going to have to be told 'no way - leave it as PDF. One's just come up - PDF page graphics with no readable text content,and plenty of quotes in Greek and Hebrew! Life's too short... :-) MobiPocket just spat out a set of PNG graphics, one for each page.
I love the way MobiPocket makes a folder with HTML file and an Images folder containing any illustrations - drag the HTML onto your Sigil icon, the images come across automatically! The discontinued development of Sigil is a great loss. Though I suppose it already covers just about everything you CAN do in EPUB 2, and we'll be making books in that format for a good time yet! |
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
ePub to pdf: Doesn't respect soft hyphens in ePub | EbokJunkie | Conversion | 4 | 11-18-2013 03:27 AM |
PDF Margins on Epub to PDF | viker | Conversion | 3 | 04-02-2012 12:18 AM |