Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Formats > ePub

Notices

Reply
 
Thread Tools Search this Thread
Old 08-10-2014, 06:21 PM   #1
PHC
Member
PHC is as sexy as a twisted cruller doughtnut.PHC is as sexy as a twisted cruller doughtnut.PHC is as sexy as a twisted cruller doughtnut.PHC is as sexy as a twisted cruller doughtnut.PHC is as sexy as a twisted cruller doughtnut.PHC is as sexy as a twisted cruller doughtnut.PHC is as sexy as a twisted cruller doughtnut.PHC is as sexy as a twisted cruller doughtnut.PHC is as sexy as a twisted cruller doughtnut.PHC is as sexy as a twisted cruller doughtnut.PHC is as sexy as a twisted cruller doughtnut.
 
Posts: 21
Karma: 15000
Join Date: Feb 2014
Device: iPhone, iPad, Macbook Pro, Mac Pro
Converting PDF to epub using Acrobat and Calibre CLI

I read the thread Converting a PDF to mobi and having it come out right? which was about converting the very book I have on hand to mobi. BTW the book is 'The Girl on the Dock' and is only available in PDF. I just spent a day and a half trying various methods of converting this book to epub. I think whoever decided to publish this book in two-facing-pages PDF should be fired, if not molested with Petra's magic wand. ;-)

The solution I finally settled upon is as follows, step by step. It produces a decent epub with all illustrations, but some of the sentences are unexplainably split and the paragraphs have no spacing between them no matter how hard I tried. I could hand edit these but it is not worth my time. I suspect the lousy PDF is not formatted correctly. There is also no TOC but that is because the stupidly authored PDF has none. I could also add that but for 5 short chapters it doesn't seem worthwhile. Note that I saved the cropped file to HTML because Calibre does an infinitely better conversion of HTML than it does of PDF.

Procedure:

1. Open PDF in Acrobat X:

Tools->Pages->Header & Footer->Remove…:

Removes only the text above the upper hairline.

Tools->Pages->Crop:

Select entire region between hairlines at full width. Double click the selection. Select Page Range->All and click OK. Crops all pages and can be undone. Do not check 'Remove White Margins' or it will include the hairlines.

Tools->Protection->Remove Hidden Information:

When Status is 'Finding Hidden Information…Done', click Remove.

File->Save As->More Options->HTML Web Page->Settings…:

Check 'Include Images'. Uncheck 'Run OCR if needed' or it produces unwanted artifacts.

See Cropping Pages Permanently with Acrobat Pro for more information.

2. The HTML file needs touching up--the 'illuminated' first character of each chapter is missing. Open in plain text editor and find ">one<". Then type in the missing 'P' in 'Petra'. Find ">two<" and type in the missing 'T' in 'The'. Repeat find up to ">five<" and type in missing uppercase character.

3. Calibre CLI:

Code:
ebook-convert "The Girl on the Dock.html" "The Girl on the Dock.epub" --no-default-epub-cover --pretty-print --preserve-cover-aspect-ratio  --enable-heuristics --insert-blank-line --cover "cover_image.png" --title "The Girl on the Dock" --authors "G. Norman Lippert"
This is specific to my book but could easily be adapted to any book.


Original PDF:


[Image violates guidelines for size - MODERATOR]



Resulting epub:



[Image violates guidelines for size - MODERATOR]

Last edited by Dr. Drib; 08-17-2014 at 06:44 AM.
PHC is offline   Reply With Quote
Old 08-10-2014, 07:53 PM   #2
JSWolf
Resident Curmudgeon
JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.
 
JSWolf's Avatar
 
Posts: 74,044
Karma: 129333562
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
You left out one very important step. When doing PDF > ePub, you have to do a 100% A/b conversion. That means you have to check everything including the formatting. That means every letter, every space, every punctuation mark, every number, everything. Even images need to be checked. And you need to check that it's properly formatted. It's not an easy task. It's not just run Calibre, convert and you are done. You are FAR from done. Without the A/B checking, you aren't done and never will be.
JSWolf is offline   Reply With Quote
Advert
Old 08-11-2014, 12:12 AM   #3
PHC
Member
PHC is as sexy as a twisted cruller doughtnut.PHC is as sexy as a twisted cruller doughtnut.PHC is as sexy as a twisted cruller doughtnut.PHC is as sexy as a twisted cruller doughtnut.PHC is as sexy as a twisted cruller doughtnut.PHC is as sexy as a twisted cruller doughtnut.PHC is as sexy as a twisted cruller doughtnut.PHC is as sexy as a twisted cruller doughtnut.PHC is as sexy as a twisted cruller doughtnut.PHC is as sexy as a twisted cruller doughtnut.PHC is as sexy as a twisted cruller doughtnut.
 
Posts: 21
Karma: 15000
Join Date: Feb 2014
Device: iPhone, iPad, Macbook Pro, Mac Pro
Heh-heh. Very amusing post. But as I mentioned, it is as far as I would take it, especially for THIS book. I mean it is now perfectly legible in my iPhone/iPad reader. I'm not exactly publishing it. Besides, it was a crappy authoring job provided by the publisher, so if anything, this is already a vast improvement. It is doubtful that Acrobat would output anything other than the correct text and Calibre the same. The main difference, as I pointed out, is the formatting of SOME sentences and 5 missing characters that I replaced. You're being far too nitpicking.
PHC is offline   Reply With Quote
Old 08-11-2014, 01:20 AM   #4
eschwartz
Ex-Helpdesk Junkie
eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.
 
eschwartz's Avatar
 
Posts: 19,422
Karma: 85397180
Join Date: Nov 2012
Location: The Beaten Path, USA, Roundworld, This Side of Infinity
Device: Kindle Touch fw5.3.7 (Wifi only)
Quote:
Originally Posted by PHC View Post
Heh-heh. Very amusing post. But as I mentioned, it is as far as I would take it, especially for THIS book. I mean it is now perfectly legible in my iPhone/iPad reader. I'm not exactly publishing it. Besides, it was a crappy authoring job provided by the publisher, so if anything, this is already a vast improvement. It is doubtful that Acrobat would output anything other than the correct text and Calibre the same. The main difference, as I pointed out, is the formatting of SOME sentences and 5 missing characters that I replaced. You're being far too nitpicking.
Don't mind Jon, he is a perfectionist.

Besides, this isn't an OCR so the text is fine.
You may want to play with the line-unwrap factor -- this was a killer and I gave up and joined every paragraph by hand.
(Because I wanted to, not because I had to. I'm a little bit of a perfectionist myself but it's all self-inflicted. )
Also, the missing capital letters at the beginning of chapters -- IIRC they appear misplaced, later on in the middle of the text. You may or may not care to fix it.

But other than that, it should be fine.

Good job on a successful PDF conversion and Welcome to MobileRead!

Last edited by eschwartz; 08-11-2014 at 01:24 AM.
eschwartz is offline   Reply With Quote
Old 08-11-2014, 03:23 AM   #5
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by eschwartz View Post
Besides, this isn't an OCR so the text is fine.
Heh, not necessarily. Even if the PDF is generated digitally right from source, things like ligatures can get mangled. (Or as the original poster mentioned, a few of the drop caps didn't convert).

Also, things like accented characters might not be stored properly. (Let's say instead of an é, the PDF might store it as an 'e', and then just "draw a little shape above it at coordinates X,Y").

It all depends on how the PDF was put together. It is an extremely complex format that was designed for OUTPUT + PRINT, and as most of the users here constantly mention, it is NOT a very good input/intermediate format.

Quote:
Originally Posted by eschwartz View Post
Also, the missing capital letters at the beginning of chapters -- IIRC they appear misplaced, later on in the middle of the text. You may or may not care to fix it.
This is also something depending on how it is built. Many of the PDFs I have seen are not logically stored/tagged/accessible.

So let us say on the physical page, you would see something like this:

HEADER
TEXT1
IMAGE
TEXT2
FOOTNOTES
FOOTER

In the actual PDF structure (or when you look at it in Text Mode), you might see it laid out like

FOOTNOTES
TEXT1
TEXT2
HEADER
FOOTER
IMAGE

All PDF cares about is how the final output will look, so it doesn't really matter if the footnotes + text are stored out of order. As long as they DISPLAY in order, that is their ultimate goal.

Quote:
Originally Posted by PHC View Post
[...] It produces a decent epub with all illustrations, but some of the sentences are unexplainably split and the paragraphs have no spacing between them no matter how hard I tried. I could hand edit these but it is not worth my time. I suspect the lousy PDF is not formatted correctly.
I tend to use these two Regexes, and fix these one-by-one. It doesn't take very long to go through an entire book.

Search: -</p>\s+<p>
Replace: (NOTHING)

What the first Regex will do, is look at any paragraph that ends in a hyphen, and then combine it with the next paragraph. So this would fix something like:

Code:
<p>He then went into the cab-</p>
<p>oose and sat in the seat.</p>
Search: ([^>”\?\!\.])</p>\s+<p>
Replace: \1

(There is a SPACE after that "\1 ").

What this second Regex will do, is look for any paragraph that ends in anything that is NOT '>', right double quote, question mark, exclamation point, or period (feel free to stick whatever other punctuation marks you want in there). It will combine that with the paragraph after it.

So it would fix something like this:

Code:
<p>He then stood up,</p>
<p>and shuffled his way out of the</p>
<p>train, but he forgot his luggage!</p>
Quote:
Originally Posted by eschwartz View Post
Good job on a successful PDF conversion and Welcome to MobileRead!
Same! Welcome PHC, and enjoy your stay.

This is a great introduction, and thanks for taking your time to write up a few steps/tutorial.

Also, if you weren't aware, there is a PDF cropping program called k2pdfopt, which can be found in a sticky in the PDF section of MobileRead:

https://www.mobileread.com/forums/sho...d.php?t=144711

Last edited by Tex2002ans; 08-11-2014 at 03:46 AM.
Tex2002ans is offline   Reply With Quote
Advert
Old 08-11-2014, 06:37 AM   #6
mrmikel
Color me gone
mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.
 
Posts: 2,089
Karma: 1445295
Join Date: Apr 2008
Location: Central Oregon Coast
Device: PRS-300
In any PDF conversion, you have to be careful that the paragraphs are all in order. They often get jumbled and very often you can not tell they are out of order unless you read very carefully. It is particularly bad if the original had columns.
mrmikel is offline   Reply With Quote
Old 08-12-2014, 02:30 PM   #7
theducks
Well trained by Cats
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 29,818
Karma: 54830978
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
With PDF Ligatures do not convert well,
Images can end up placed anywhere.

Does this affect all Paragraphs or just blockquotes or pull quotes?
theducks is offline   Reply With Quote
Old 08-12-2014, 07:42 PM   #8
mrmikel
Color me gone
mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.
 
Posts: 2,089
Karma: 1445295
Join Date: Apr 2008
Location: Central Oregon Coast
Device: PRS-300
Around images it is a problem. If there is small chunk of text in column 2 it often gets thrown over into column1. Text flowing around images is a challenge.

You also can seldom trust the text which is provided as part of the pdf, as the text layer is used solely for searching, so it is not proofread very well. They figure if most of it is alright, then the search will be ok.

Generally, the newer the better because the text of the original is used to create it, not added after the fact. But newer the better is not good here because the newer, the copyrighter, which can result in a ban for incautious people.

Everyone is happy the OP managed to get things to work so well. I guess what we are collectively saying is, "Don't make any promises to anyone based on your one time experience. You could find yourself in the dog house floating down ^&*(& creek without a paddle!"
mrmikel is offline   Reply With Quote
Old 08-13-2014, 01:24 AM   #9
Toxaris
Wizard
Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.
 
Toxaris's Avatar
 
Posts: 4,520
Karma: 121692313
Join Date: Oct 2009
Location: Heemskerk, NL
Device: PRS-T1, Kobo Touch, Kobo Aura
That is why I usually end up with OCR and subsequent processing... Seen too many strange things with the text export...
Toxaris is offline   Reply With Quote
Old 08-13-2014, 03:06 AM   #10
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by Toxaris View Post
That is why I usually end up with OCR and subsequent processing... Seen too many strange things with the text export...
Yep, can't trust any of these dang PDF creation programs.

Even using the same program, you don't know which settings people clicked. Did they generate this PDF using LibreOffice, and enabled "Tagged PDF"? Did they generate it using InDesign using the proper (accessibility) settings? What dang "PDF Printer" did they run it through in Word (and what were the settings)?

After they generated the original PDF, did they run it through some crappy "PDF Editing" software to add a Cover/Title Page, or do something simple like ADD METADATA? (By the gods, those "Editing" softwares absolutely mangle PDFs).

Since the text is quite crisp (since it is a purely digital file), the OCR should be QUITE accurate, and have few errors.

Although enough poopooing on how bad PDF is as an input format! Let's remain positive!
Tex2002ans is offline   Reply With Quote
Old 08-13-2014, 03:25 AM   #11
Jellby
frumious Bandersnatch
Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.
 
Jellby's Avatar
 
Posts: 7,516
Karma: 18512745
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
Quote:
Originally Posted by Tex2002ans View Post
Since the text is quite crisp (since it is a purely digital file), the OCR should be QUITE accurate, and have few errors.
Not to mention you can generate raster images at any desired resolution.
Jellby is offline   Reply With Quote
Old 08-13-2014, 09:33 AM   #12
JSWolf
Resident Curmudgeon
JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.
 
JSWolf's Avatar
 
Posts: 74,044
Karma: 129333562
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
Quote:
Originally Posted by PHC View Post
Heh-heh. Very amusing post. But as I mentioned, it is as far as I would take it, especially for THIS book. I mean it is now perfectly legible in my iPhone/iPad reader. I'm not exactly publishing it. Besides, it was a crappy authoring job provided by the publisher, so if anything, this is already a vast improvement. It is doubtful that Acrobat would output anything other than the correct text and Calibre the same. The main difference, as I pointed out, is the formatting of SOME sentences and 5 missing characters that I replaced. You're being far too nitpicking.
It's not meant to be amusing. It's meant to be what is and what is is not easy when dealing with converting PDF.
JSWolf is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Problem while converting pdf to epub using calibre hszforu Calibre 4 02-24-2012 08:48 AM
Converting Epub to HTML from CLI removes formatting drjonez Conversion 2 01-20-2012 12:07 PM
Problem converting PDF to EPUB in calibre adgpro Calibre 2 07-09-2010 01:10 AM
Converting from PDF to ePub, Calibre not working Alda ePub 10 07-09-2010 01:00 AM


All times are GMT -4. The time now is 08:36 AM.


MobileRead.com is a privately owned, operated and funded community.