View Single Post
Old 08-11-2014, 03:23 AM   #5
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by eschwartz View Post
Besides, this isn't an OCR so the text is fine.
Heh, not necessarily. Even if the PDF is generated digitally right from source, things like ligatures can get mangled. (Or as the original poster mentioned, a few of the drop caps didn't convert).

Also, things like accented characters might not be stored properly. (Let's say instead of an é, the PDF might store it as an 'e', and then just "draw a little shape above it at coordinates X,Y").

It all depends on how the PDF was put together. It is an extremely complex format that was designed for OUTPUT + PRINT, and as most of the users here constantly mention, it is NOT a very good input/intermediate format.

Quote:
Originally Posted by eschwartz View Post
Also, the missing capital letters at the beginning of chapters -- IIRC they appear misplaced, later on in the middle of the text. You may or may not care to fix it.
This is also something depending on how it is built. Many of the PDFs I have seen are not logically stored/tagged/accessible.

So let us say on the physical page, you would see something like this:

HEADER
TEXT1
IMAGE
TEXT2
FOOTNOTES
FOOTER

In the actual PDF structure (or when you look at it in Text Mode), you might see it laid out like

FOOTNOTES
TEXT1
TEXT2
HEADER
FOOTER
IMAGE

All PDF cares about is how the final output will look, so it doesn't really matter if the footnotes + text are stored out of order. As long as they DISPLAY in order, that is their ultimate goal.

Quote:
Originally Posted by PHC View Post
[...] It produces a decent epub with all illustrations, but some of the sentences are unexplainably split and the paragraphs have no spacing between them no matter how hard I tried. I could hand edit these but it is not worth my time. I suspect the lousy PDF is not formatted correctly.
I tend to use these two Regexes, and fix these one-by-one. It doesn't take very long to go through an entire book.

Search: -</p>\s+<p>
Replace: (NOTHING)

What the first Regex will do, is look at any paragraph that ends in a hyphen, and then combine it with the next paragraph. So this would fix something like:

Code:
<p>He then went into the cab-</p>
<p>oose and sat in the seat.</p>
Search: ([^>”\?\!\.])</p>\s+<p>
Replace: \1

(There is a SPACE after that "\1 ").

What this second Regex will do, is look for any paragraph that ends in anything that is NOT '>', right double quote, question mark, exclamation point, or period (feel free to stick whatever other punctuation marks you want in there). It will combine that with the paragraph after it.

So it would fix something like this:

Code:
<p>He then stood up,</p>
<p>and shuffled his way out of the</p>
<p>train, but he forgot his luggage!</p>
Quote:
Originally Posted by eschwartz View Post
Good job on a successful PDF conversion and Welcome to MobileRead!
Same! Welcome PHC, and enjoy your stay.

This is a great introduction, and thanks for taking your time to write up a few steps/tutorial.

Also, if you weren't aware, there is a PDF cropping program called k2pdfopt, which can be found in a sticky in the PDF section of MobileRead:

https://www.mobileread.com/forums/sho...d.php?t=144711

Last edited by Tex2002ans; 08-11-2014 at 03:46 AM.
Tex2002ans is offline   Reply With Quote