View Single Post
Old 08-10-2014, 06:21 PM   #1
PHC
Member
PHC is as sexy as a twisted cruller doughtnut.PHC is as sexy as a twisted cruller doughtnut.PHC is as sexy as a twisted cruller doughtnut.PHC is as sexy as a twisted cruller doughtnut.PHC is as sexy as a twisted cruller doughtnut.PHC is as sexy as a twisted cruller doughtnut.PHC is as sexy as a twisted cruller doughtnut.PHC is as sexy as a twisted cruller doughtnut.PHC is as sexy as a twisted cruller doughtnut.PHC is as sexy as a twisted cruller doughtnut.PHC is as sexy as a twisted cruller doughtnut.
 
Posts: 21
Karma: 15000
Join Date: Feb 2014
Device: iPhone, iPad, Macbook Pro, Mac Pro
Converting PDF to epub using Acrobat and Calibre CLI

I read the thread Converting a PDF to mobi and having it come out right? which was about converting the very book I have on hand to mobi. BTW the book is 'The Girl on the Dock' and is only available in PDF. I just spent a day and a half trying various methods of converting this book to epub. I think whoever decided to publish this book in two-facing-pages PDF should be fired, if not molested with Petra's magic wand. ;-)

The solution I finally settled upon is as follows, step by step. It produces a decent epub with all illustrations, but some of the sentences are unexplainably split and the paragraphs have no spacing between them no matter how hard I tried. I could hand edit these but it is not worth my time. I suspect the lousy PDF is not formatted correctly. There is also no TOC but that is because the stupidly authored PDF has none. I could also add that but for 5 short chapters it doesn't seem worthwhile. Note that I saved the cropped file to HTML because Calibre does an infinitely better conversion of HTML than it does of PDF.

Procedure:

1. Open PDF in Acrobat X:

Tools->Pages->Header & Footer->Remove…:

Removes only the text above the upper hairline.

Tools->Pages->Crop:

Select entire region between hairlines at full width. Double click the selection. Select Page Range->All and click OK. Crops all pages and can be undone. Do not check 'Remove White Margins' or it will include the hairlines.

Tools->Protection->Remove Hidden Information:

When Status is 'Finding Hidden Information…Done', click Remove.

File->Save As->More Options->HTML Web Page->Settings…:

Check 'Include Images'. Uncheck 'Run OCR if needed' or it produces unwanted artifacts.

See Cropping Pages Permanently with Acrobat Pro for more information.

2. The HTML file needs touching up--the 'illuminated' first character of each chapter is missing. Open in plain text editor and find ">one<". Then type in the missing 'P' in 'Petra'. Find ">two<" and type in the missing 'T' in 'The'. Repeat find up to ">five<" and type in missing uppercase character.

3. Calibre CLI:

Code:
ebook-convert "The Girl on the Dock.html" "The Girl on the Dock.epub" --no-default-epub-cover --pretty-print --preserve-cover-aspect-ratio  --enable-heuristics --insert-blank-line --cover "cover_image.png" --title "The Girl on the Dock" --authors "G. Norman Lippert"
This is specific to my book but could easily be adapted to any book.


Original PDF:


[Image violates guidelines for size - MODERATOR]



Resulting epub:



[Image violates guidelines for size - MODERATOR]

Last edited by Dr. Drib; 08-17-2014 at 06:44 AM.
PHC is offline   Reply With Quote