PDF to Kindle: The unobtainable Holy Grail of ebooks - Page 3

emalvick · 10-18-2011, 06:56 PM

Quote:

Originally Posted by DiapDealer

Actually most people are fine with PDF's. I know I am (except on 6 inch screens). Most of the negativity comes into play when someone wants to turn a PDF into something else. A PDF just doesn't convert well... easily. I doubt it ever will.

Your last sentence is very important... The whole point of PDF was so that a document would always look the same regardless of the computer reading it or the printer printing it.

... That being said, as a former graduate student and a researcher who reads and writes technical PDF's I can't say I am happy or not about what you can and can't do with them on a Kindle. I appreciate what they do for journals, printing, etc, but I do wish I could easily read them on my Kindle.

I can imagine other tablets are a decent solution, even a DX may be a decent solution, but in the technical world color is becoming more common, and I personally hate reading off a backlit monitor.

I really just wish the Kindle had a mechanism so that PDF's could be somewhat cropped to a specific size (you can do that via the zoom already) and then the page forward buttons would quickly scroll you to the bottom of the page and then the next page... similar to the way the page-up and page-down button works on a PC with Acrobat Reader.

It isn't ideal, but I don't think PDF's should need converting to ebooks. I just wish they could be handled a little better. I hate having to scroll around with the arrow buttons and the page forward can have me skipping parts of pages that I don't want to be skipping. The color issue will have to wait until colored e-ink becomes an option if it ever does.

By the way, I do find that turning the Kindle sideways and reading a PDF (6 in screen) is a reasonable method of reading a PDF.

DiapDealer · 10-18-2011, 07:06 PM

Quote:

Originally Posted by Blossom

So it doesn't convert italics? I was going to give a shot but I do get good results with Acrobat Pro on Novel PDFs. It pulls the styles from the PDF just fine as long as the PDF is tagged.

No, I assume it just uses the OCR text layer, but I could be wrong. I use Acrobat Pro a lot too, but it's always been a bit of a toss-up between it and other programs for me. I like that Acrobat will retain a lot of the styles when exporting, but if the page numbers and such (headers and footers) are not true adobe headers and footers (as is usually the case)... I still have to rely on external programs to strip them. And even then they're not truly "removed" from the PDF only hidden from view (and conversion programs will add them right back in to the mobi or epub.

So I usually have to decide between HTML with italics—but with pesky headers and footers to track down and remove (Acrobat). Or really nice, clean HTML with no pesky headers and footers, but no italics (PDFMasher). Both need regexed for paragraph fragments.

JSWolf · 10-18-2011, 07:29 PM

Quote:

Originally Posted by jswinden

This whole PDF discussion thing is getting pretty old. Adobe designed PDFs to be printed, not read on E Ink readers. They designed PDFs over 20 years ago for the purpose of being able to exchange secured documents digitally without worrying about unauthorized editing of those documents. For example, a lawyer could send a contract to a client via email. PDFs were never designed for our viewing pleasure!!! True, Adobe has tried to update PDF over the years, but it is still THE WORST form of document for reading on an electronic device.

Not quite. PDF was created to allow you to send a document to someone to be printed so you don't need to have the same program/fonts that was used to create the document. It wasn't about not being able to edit. It was about being able to duplicate the document on paper so what I send you will look the same on paper as when I print it from whatever program created it.

PDF was never designed to have the information needed to convert it to another format and it never will. Basically, if you have a PDF, the only way to convert it is to pick a program to convert it and then A/B compare every single pixel/letter/punctuation/etc. and also do any format fixing that needs to be done. Then you'll have your conversion. There is NO program that can convert a PDF of any reasonable size error free.

JSWolf · 10-18-2011, 07:42 PM

Quote:

Originally Posted by DiapDealer

No, I assume it just uses the OCR text layer, but I could be wrong. I use Acrobat Pro a lot too, but it's always been a bit of a toss-up between it and other programs for me. I like that Acrobat will retain a lot of the styles when exporting, but if the page numbers and such (headers and footers) are not true adobe headers and footers (as is usually the case)... I still have to rely on external programs to strip them. And even then they're not truly "removed" from the PDF only hidden from view (and conversion programs will add them right back in to the mobi or epub.

So I usually have to decide between HTML with italics—but with pesky headers and footers to track down and remove (Acrobat). Or really nice, clean HTML with no pesky headers and footers, but no italics (PDFMasher). Both need regexed for paragraph fragments.

Acrobat Pro can handle the headers/footers just fine. All you need do is crop the pages so the headers/footers don't exist and then convert. That gets rid of them very well. Better then any other method.

Abichuela · 10-18-2011, 10:58 PM

Forgive me if I missed something, but if PDF isn't the best format to convert from, what is? Is it better to convert from a Word format to .mobi or .epub?

Blossom · 10-18-2011, 11:06 PM

Quote:

Originally Posted by Abichuela

Forgive me if I missed something, but if PDF isn't the best format to convert from, what is? Is it better to convert from a Word format to .mobi or .epub?

Lit, epub or html those are easy formats to work with.
I use Word html as my source then import it into Calibre and convert to mobi and epub.

tentimes · 10-19-2011, 07:15 AM

Quote:

Originally Posted by Blossom

So it doesn't convert italics? I was going to give a shot but I do get good results with Acrobat Pro on Novel PDFs. It pulls the styles from the PDF just fine as long as the PDF is tagged.

Blossom, what is it you are doing with Acrobat Pro to convert please? I have got a trial of it, but really unsure of how it is going to help me. I have tried exporting to word, but the results were pretty poor unfortunately (free.kindle.com converted better).

Maybe you are doing a few things together that are helping to make a good conversion? I would be most grateful for any advice

avantman42 · 10-19-2011, 08:51 AM

Quote:

Originally Posted by tentimes

If it's a matter of a series of text boxes per page, then it's a matter of (assuming most pages don't overlap these box areas and overprint) taking the text boxes in order, getting the relative font sizes, assuming the large font sizes with the text form "Chapter XX" are start of chapter

Have you tried Calibre? The heuristic processing option does this sort of thing, but is disabled by default.

Quote:

Originally Posted by tentimes

"Chapter XX" are start of chapter of there is no internal byte code to five you end of chapter (which I bet there is)

Honestly, I'd be prepared to bet there isn't, but I'd like to be wrong.

I've found pdftohtml gives good results with some PDFs. Calibre and pdftohtml are both open source, so if you do decide to try and write something better, it might be worth having a look at how they do things.

Zeebra · 10-19-2011, 09:45 AM

Quote:

Originally Posted by DiapDealer

I've found the footnotes function to be a tad flaky with PDFMasher, but I've gotten pretty close a few time with documents that had buckets of footnotes. Close enough that I didn't mind fixing up the results by hand. And it seems to be getting better all the time. The different sorting abilities makes it pretty powerful and it's by far my favorite for very simply formatted novels, but losing italics really annoys me when converting PDF's (not PDFMasher's fault, I know).

Yeah when I figured out the sorting abilities it was a big "TA-DA!" moment for me to identify the headers and footers easily. I kinda like this app, not that I had many PDFs to convert but it's pretty cool.

Blossom · 10-19-2011, 01:04 PM

Quote:

Originally Posted by tentimes

Blossom, what is it you are doing with Acrobat Pro to convert please? I have got a trial of it, but really unsure of how it is going to help me. I have tried exporting to word, but the results were pretty poor unfortunately (free.kindle.com converted better).

Maybe you are doing a few things together that are helping to make a good conversion? I would be most grateful for any advice

I just convert the PDF to html 3.2 I use Acrobat Pro 9 Then I open it up in Word 2003 and fix the broken sentences with regular expressions. Then I fixed the chapter headers to match each other. I then do several regular expressions to check for things I missed like page number, headers, footers...etc It takes about 5 to 10 minutes to get a good readable copy once you have the method down.

alansplace · 10-19-2011, 01:10 PM

Quote:

Originally Posted by Blossom

I just convert the PDF to html 3.2 I use Acrobat Pro 9 Then I open it up in Word 2003 and fix the broken sentences with regular expressions. Then I fixed the chapter headers to match each other. I then do several regular expressions to check for things I missed like page number, headers, footers...etc It takes about 5 to 10 minutes to get a good readable copy once you have the method down.

if you've saved those regex[s] you should zip them up and share them in a post somewhere.

Blossom · 10-19-2011, 01:39 PM

Quote:

Originally Posted by alansplace

if you've saved those regex[s] you should zip them up and share them in a post somewhere.

They are really more Word 2003 wildcards but Basically This is my reference notes I hope you can make heads or tales out of them.

Code:

Do a S&R for Manual line breaks and replace with paragraph marks.

MS Word it uses ^13 for a return, with wildcard box checked in the Search Box

^13([a-z]) = This checks for broken sentences

([a-zA-Z])^13 = This checks for broken sentences

([a-z])^13([A-Z]) = This checks for broken sentences

Replace Box
\1 and \2 if there is more then one bracket, add appropriate spaces as needed.

[0-9]{1,}^13 = This checks for page numbers 
[0-9]{1,} = Second check for page numbers and OCR error where numbers replace letters. 

[A-Z]{3,} = Match Case checked, Replace 3, if needed for more word matches.

On Chapter Headers I use S&R if they are already in bold this makes it easier, then I do a search to find bold text using the formatting button. Word has a powerful search! You can search by formatting or wildcards, special word characters or just the regular way. I can then do a replace only on the formatting.

I also use the Styles panel to make batch changes. Alot of back titles I buy have inconsistency when it comes to formatting this feature comes in handy to fix that quick. Highlighting a chapter heading and then click Clear formatting and clicking the appropriate style will really help it to take on the correct formatting you want.

I also use Macros to make it alot faster!

DiapDealer · 10-19-2011, 02:41 PM

For broken sentences in HTML, I use the following search regex:

Code:

([^.”":?’'!>—…)])</p>\s+<p[^>]*>

And the replace would be:

Code:

\1

(NOTE: there needs to be a "space" character following the \1 for it to work properly)

I don't trust it enough to blindly do a "Replace All" on a whole book, but I rarely have to intervene when stepping through a document an incident at a time.

Blossom · 10-19-2011, 02:45 PM

Quote:

Originally Posted by DiapDealer

For broken sentences in HTML, I use the following search regex:

Code:

([^.”":?’'!>—…)])</p>\s+<p[^>]*>

And the replace would be:

Code:

\1

(NOTE: there needs to be a "space" character following the \1 for it to work properly)

I don't trust it enough to blindly do a "Replace All" on a whole book, but I rarely have to intervene when stepping through a document an incident at a time.

I will have try this when working with code.

What program does this work with? I've tried Notepad++ and Notepad2 and it can't find anything.

DiapDealer · 10-19-2011, 03:04 PM

Quote:

Originally Posted by Blossom

I will have try this when working with code.

What program does this work with? I've tried Notepad++ and Notepad2 and it can't find anything.

I use it mostly with Sigil and Komodo Edit. I like Notepad++ as a code editor, but it gives me fits when trying to use more complex, multi-line, regex S&R.

10-19-2011, 02:41 PM	#43
DiapDealer Grand Sorcerer Posts: 27,546 Karma: 193191846 Join Date: Jan 2010 Device: Nexus 7, Kindle Fire HD	For broken sentences in HTML, I use the following search regex: Code: ([^.”":?’'!>—…)])</p>\s+<p[^>]> And the replace would be: Code: \1 (NOTE:* there needs to be a "space" character following the \1 for it to work properly) I don't trust it enough to blindly do a "Replace All" on a whole book, but I rarely have to intervene when stepping through a document an incident at a time.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
KINDLE DEAL: The Holy Bible: NKJV ($3.36 CANADA)	gospelebooks	Deals and Resources (No Self-Promotion or Affiliate Links)	2	04-09-2011 12:07 PM
Free Book (Kindle / Nook) - The Holy Bible	koland	Deals and Resources (No Self-Promotion or Affiliate Links)	21	11-14-2010 01:51 PM
Free Book (Kindle) - The Holy Bible	koland	Deals and Resources (No Self-Promotion or Affiliate Links)	21	10-09-2010 10:31 AM
Free Book (Kindle) - Holy Bible (GW)	koland	Deals and Resources (No Self-Promotion or Affiliate Links)	0	10-04-2010 03:29 AM
The search for the Holy Grail of reading lights continues	Bob Russell	News	19	04-01-2009 01:24 PM

10-18-2011, 10:58 PM	#35
Abichuela Junior Member Posts: 3 Karma: 10 Join Date: Oct 2011 Device: Kindle	Forgive me if I missed something, but if PDF isn't the best format to convert from, what is? Is it better to convert from a Word format to .mobi or .epub?

Advert

Advert