|  08-10-2014, 06:21 PM | #1 | 
| Member            Posts: 21 Karma: 15000 Join Date: Feb 2014 Device: iPhone, iPad, Macbook Pro, Mac Pro | 
				
				Converting PDF to epub using Acrobat and Calibre CLI
			 
			
			I read the thread Converting a PDF to mobi and having it come out right? which was about converting the very book I have on hand to mobi. BTW the book is 'The Girl on the Dock' and is only available in PDF. I just spent a day and a half trying various methods of converting this book to epub. I think whoever decided to publish this book in two-facing-pages PDF should be fired, if not molested with Petra's magic wand. ;-)  The solution I finally settled upon is as follows, step by step. It produces a decent epub with all illustrations, but some of the sentences are unexplainably split and the paragraphs have no spacing between them no matter how hard I tried. I could hand edit these but it is not worth my time. I suspect the lousy PDF is not formatted correctly. There is also no TOC but that is because the stupidly authored PDF has none. I could also add that but for 5 short chapters it doesn't seem worthwhile. Note that I saved the cropped file to HTML because Calibre does an infinitely better conversion of HTML than it does of PDF. Procedure: 1. Open PDF in Acrobat X: Tools->Pages->Header & Footer->Remove…: Removes only the text above the upper hairline. Tools->Pages->Crop: Select entire region between hairlines at full width. Double click the selection. Select Page Range->All and click OK. Crops all pages and can be undone. Do not check 'Remove White Margins' or it will include the hairlines. Tools->Protection->Remove Hidden Information: When Status is 'Finding Hidden Information…Done', click Remove. File->Save As->More Options->HTML Web Page->Settings…: Check 'Include Images'. Uncheck 'Run OCR if needed' or it produces unwanted artifacts. See Cropping Pages Permanently with Acrobat Pro for more information. 2. The HTML file needs touching up--the 'illuminated' first character of each chapter is missing. Open in plain text editor and find ">one<". Then type in the missing 'P' in 'Petra'. Find ">two<" and type in the missing 'T' in 'The'. Repeat find up to ">five<" and type in missing uppercase character. 3. Calibre CLI: Code: ebook-convert "The Girl on the Dock.html" "The Girl on the Dock.epub" --no-default-epub-cover --pretty-print --preserve-cover-aspect-ratio --enable-heuristics --insert-blank-line --cover "cover_image.png" --title "The Girl on the Dock" --authors "G. Norman Lippert" Original PDF: [Image violates guidelines for size - MODERATOR] Resulting epub: [Image violates guidelines for size - MODERATOR] Last edited by Dr. Drib; 08-17-2014 at 06:44 AM. | 
|   |   | 
|  08-10-2014, 07:53 PM | #2 | 
| Resident Curmudgeon            Posts: 80,685 Karma: 150249619 Join Date: Nov 2006 Location: Roslindale, Massachusetts Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3 | 
			
			You left out one very important step. When doing PDF > ePub, you have to do a 100% A/b conversion. That means you have to check everything including the formatting. That means every letter, every space, every punctuation mark, every number, everything. Even images need to be checked. And you need to check that it's properly formatted. It's not an easy task. It's not just run Calibre, convert and you are done. You are FAR from done. Without the A/B checking, you aren't done and never will be.
		 | 
|   |   | 
| Advert | |
|  | 
|  08-11-2014, 12:12 AM | #3 | 
| Member            Posts: 21 Karma: 15000 Join Date: Feb 2014 Device: iPhone, iPad, Macbook Pro, Mac Pro | 
			
			Heh-heh. Very amusing post. But as I mentioned, it is as far as I would take it, especially for THIS book. I mean it is now perfectly legible in my iPhone/iPad reader. I'm not exactly publishing it. Besides, it was a crappy authoring job provided by the publisher, so if anything, this is already a vast improvement. It is doubtful that Acrobat would output anything other than the correct text and Calibre the same. The main difference, as I pointed out, is the formatting of SOME sentences and 5 missing characters that I replaced. You're being far too nitpicking.
		 | 
|   |   | 
|  08-11-2014, 01:20 AM | #4 | |
| Ex-Helpdesk Junkie            Posts: 19,421 Karma: 85400180 Join Date: Nov 2012 Location: The Beaten Path, USA, Roundworld, This Side of Infinity Device: Kindle Touch fw5.3.7 (Wifi only) | Quote: 
  Besides, this isn't an OCR so the text is fine. You may want to play with the line-unwrap factor -- this was a killer and I gave up and joined every paragraph by hand. (Because I wanted to, not because I had to. I'm a little bit of a perfectionist myself  but it's all self-inflicted.  ) Also, the missing capital letters at the beginning of chapters -- IIRC they appear misplaced, later on in the middle of the text. You may or may not care to fix it. But other than that, it should be fine. Good job on a successful PDF conversion and Welcome to MobileRead! Last edited by eschwartz; 08-11-2014 at 01:24 AM. | |
|   |   | 
|  08-11-2014, 03:23 AM | #5 | |||
| Wizard            Posts: 2,306 Karma: 13057279 Join Date: Jul 2012 Device: Kobo Forma, Nook | 
			
			Heh, not necessarily. Even if the PDF is generated digitally right from source, things like ligatures can get mangled. (Or as the original poster mentioned, a few of the drop caps didn't convert). Also, things like accented characters might not be stored properly. (Let's say instead of an é, the PDF might store it as an 'e', and then just "draw a little shape above it at coordinates X,Y"). It all depends on how the PDF was put together. It is an extremely complex format that was designed for OUTPUT + PRINT, and as most of the users here constantly mention, it is NOT a very good input/intermediate format. Quote: 
 So let us say on the physical page, you would see something like this: HEADER TEXT1 IMAGE TEXT2 FOOTNOTES FOOTER In the actual PDF structure (or when you look at it in Text Mode), you might see it laid out like FOOTNOTES TEXT1 TEXT2 HEADER FOOTER IMAGE All PDF cares about is how the final output will look, so it doesn't really matter if the footnotes + text are stored out of order. As long as they DISPLAY in order, that is their ultimate goal. Quote: 
 Search: -</p>\s+<p> Replace: (NOTHING) What the first Regex will do, is look at any paragraph that ends in a hyphen, and then combine it with the next paragraph. So this would fix something like: Code: <p>He then went into the cab-</p> <p>oose and sat in the seat.</p> Replace: \1 (There is a SPACE after that "\1 "). What this second Regex will do, is look for any paragraph that ends in anything that is NOT '>', right double quote, question mark, exclamation point, or period (feel free to stick whatever other punctuation marks you want in there). It will combine that with the paragraph after it. So it would fix something like this: Code: <p>He then stood up,</p> <p>and shuffled his way out of the</p> <p>train, but he forgot his luggage!</p> Quote: 
 This is a great introduction, and thanks for taking your time to write up a few steps/tutorial.  Also, if you weren't aware, there is a PDF cropping program called k2pdfopt, which can be found in a sticky in the PDF section of MobileRead: https://www.mobileread.com/forums/sho...d.php?t=144711 Last edited by Tex2002ans; 08-11-2014 at 03:46 AM. | |||
|   |   | 
| Advert | |
|  | 
|  08-11-2014, 06:37 AM | #6 | 
| Color me gone            Posts: 2,089 Karma: 1445295 Join Date: Apr 2008 Location: Central Oregon Coast Device: PRS-300 | 
			
			In any PDF conversion, you have to be careful that the paragraphs are all in order. They often get jumbled and very often you can not tell they are out of order unless you read very carefully. It is particularly bad if the original had columns.
		 | 
|   |   | 
|  08-12-2014, 02:30 PM | #7 | 
| Well trained by Cats            Posts: 31,241 Karma: 61360164 Join Date: Aug 2009 Location: The Central Coast of California Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A | 
			
			With PDF Ligatures do not convert well, Images can end up placed anywhere. Does this affect all Paragraphs or just blockquotes or pull quotes? | 
|   |   | 
|  08-12-2014, 07:42 PM | #8 | 
| Color me gone            Posts: 2,089 Karma: 1445295 Join Date: Apr 2008 Location: Central Oregon Coast Device: PRS-300 | 
			
			Around images it is a problem. If there is small chunk of text in column 2 it often gets thrown over into column1. Text flowing around images is a challenge. You also can seldom trust the text which is provided as part of the pdf, as the text layer is used solely for searching, so it is not proofread very well. They figure if most of it is alright, then the search will be ok. Generally, the newer the better because the text of the original is used to create it, not added after the fact. But newer the better is not good here because the newer, the copyrighter, which can result in a ban for incautious people. Everyone is happy the OP managed to get things to work so well. I guess what we are collectively saying is, "Don't make any promises to anyone based on your one time experience. You could find yourself in the dog house floating down ^&*(& creek without a paddle!" | 
|   |   | 
|  08-13-2014, 01:24 AM | #9 | 
| Wizard            Posts: 4,520 Karma: 121692313 Join Date: Oct 2009 Location: Heemskerk, NL Device: PRS-T1, Kobo Touch, Kobo Aura | 
			
			That is why I usually end up with OCR and subsequent processing... Seen too many strange things with the text export...
		 | 
|   |   | 
|  08-13-2014, 03:06 AM | #10 | |
| Wizard            Posts: 2,306 Karma: 13057279 Join Date: Jul 2012 Device: Kobo Forma, Nook | Quote: 
  Even using the same program, you don't know which settings people clicked. Did they generate this PDF using LibreOffice, and enabled "Tagged PDF"? Did they generate it using InDesign using the proper (accessibility) settings? What dang "PDF Printer" did they run it through in Word (and what were the settings)? After they generated the original PDF, did they run it through some crappy "PDF Editing" software to add a Cover/Title Page, or do something simple like ADD METADATA? (By the gods, those "Editing" softwares absolutely mangle PDFs). Since the text is quite crisp (since it is a purely digital file), the OCR should be QUITE accurate, and have few errors. Although enough poopooing on how bad PDF is as an input format! Let's remain positive!   | |
|   |   | 
|  08-13-2014, 03:25 AM | #11 | 
| frumious Bandersnatch            Posts: 7,570 Karma: 20150435 Join Date: Jan 2008 Location: Spaniard in Sweden Device: Cybook Orizon, Kobo Aura | |
|   |   | 
|  08-13-2014, 09:33 AM | #12 | |
| Resident Curmudgeon            Posts: 80,685 Karma: 150249619 Join Date: Nov 2006 Location: Roslindale, Massachusetts Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3 | Quote: 
 | |
|   |   | 
|  | 
| Thread Tools | Search this Thread | 
| 
 | 
|  Similar Threads | ||||
| Thread | Thread Starter | Forum | Replies | Last Post | 
| Problem while converting pdf to epub using calibre | hszforu | Calibre | 4 | 02-24-2012 08:48 AM | 
| Converting Epub to HTML from CLI removes formatting | drjonez | Conversion | 2 | 01-20-2012 12:07 PM | 
| Problem converting PDF to EPUB in calibre | adgpro | Calibre | 2 | 07-09-2010 01:10 AM | 
| Converting from PDF to ePub, Calibre not working | Alda | ePub | 10 | 07-09-2010 01:00 AM |