Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Readers > Amazon Kindle

Notices

Reply
 
Thread Tools Search this Thread
Old 10-14-2011, 09:35 AM   #1
tentimes
Junior Member
tentimes began at the beginning.
 
Posts: 6
Karma: 10
Join Date: Oct 2011
Device: Kindle 4
PDF to Kindle: The unobtainable Holy Grail of ebooks

Hi,

I just wanted to write of my experience (as a programmer and web developer) of 3 weeks spent using every available method to try and get a readable version of a pdf book.

It seems that there is no method that can take the bookmark (index) in the pdf book and put it in an ebook format. I find this amazing. Some try, but they are hopeless and make a mess of it.

The best I can get really is Amazon's own conversion service, or similar results using Bliss to take out some pages and then mopbipocket.

Comparing the converted version to the Amazon Kindle version, it seems to me that the two are, in terms of their data, VERY close.

Surely this is a problem someone can get right? The pdf versions of these boooks contain formatting info and an index that should be convertable.

I've tried about 8 different commercial solutions too so far and they all suck.

I am confused about why there is no program out there that can take the textual information in a pdf book, plus the index (bookmarks) and turn it into a an indexed book.

I am considering getting stuck in myself and writing one, but wondered if anyone knew what the problem is and why commercial programs don't/can't do it. Even Acrobat X Pro can only manage a word format. Text, headings, chapter headings, end of chapter markers, one contents index... open format... why isn;t it happening?

I have read many posts and "solutions" on this forum, but none of them have created a book from my pdf book that is any better than just emailing it to Amazon.

Totally confused and hoping to be illuminated As you can probably tell I am hugely frustrated after spending weeks on this.
tentimes is offline   Reply With Quote
Old 10-14-2011, 09:58 AM   #2
Daveoc64
Groupie
Daveoc64 has become one with the cosmosDaveoc64 has become one with the cosmosDaveoc64 has become one with the cosmosDaveoc64 has become one with the cosmosDaveoc64 has become one with the cosmosDaveoc64 has become one with the cosmosDaveoc64 has become one with the cosmosDaveoc64 has become one with the cosmosDaveoc64 has become one with the cosmosDaveoc64 has become one with the cosmosDaveoc64 has become one with the cosmos
 
Posts: 169
Karma: 21142
Join Date: Feb 2011
Location: Bristol, UK
Device: Kindle Oasis 3 (LTE)
PDF is simply not intended for the purpose you describe.

PDF was designed to create a file that represents how a document would look with each element on the page (text, images etc.) being at a fixed point which represents the equivalent fixed point on a piece of paper (e.g. A4).

It is also intended to be an "archive" format, where the PDF is created and then no further editing or converting takes place.

PDF's structure as a file format represents this and as such it is hard for third party tools to attempt to convert a PDF to another format.

I think that the main problem with the Contents/Bookmarks in PDF is that they simply refer to a page in the PDF.

Once you convert a PDF to a .mobi or .epub file then you lose the ability to link to those pages.

Last edited by Daveoc64; 10-14-2011 at 10:02 AM.
Daveoc64 is offline   Reply With Quote
Old 10-14-2011, 10:04 AM   #3
HomeInMyShoes
Grand Sorcerer
HomeInMyShoes ought to be getting tired of karma fortunes by now.HomeInMyShoes ought to be getting tired of karma fortunes by now.HomeInMyShoes ought to be getting tired of karma fortunes by now.HomeInMyShoes ought to be getting tired of karma fortunes by now.HomeInMyShoes ought to be getting tired of karma fortunes by now.HomeInMyShoes ought to be getting tired of karma fortunes by now.HomeInMyShoes ought to be getting tired of karma fortunes by now.HomeInMyShoes ought to be getting tired of karma fortunes by now.HomeInMyShoes ought to be getting tired of karma fortunes by now.HomeInMyShoes ought to be getting tired of karma fortunes by now.HomeInMyShoes ought to be getting tired of karma fortunes by now.
 
Posts: 19,226
Karma: 67780237
Join Date: Jul 2011
Device: none
^Agreed. PDF is not markup that is easily translatable. Your source for conversion should never be PDF, but something less finished as a product.

Welcome to the forum tentimes. I hope you find useful information here and maybe the grail you seek, but I wouldn't be surprised if the vicious bunny of PDF conversion doesn't scare you off first.
HomeInMyShoes is offline   Reply With Quote
Old 10-14-2011, 11:12 AM   #4
jswinden
Nameless Being
 
This whole PDF discussion thing is getting pretty old. Adobe designed PDFs to be printed, not read on E Ink readers. They designed PDFs over 20 years ago for the purpose of being able to exchange secured documents digitally without worrying about unauthorized editing of those documents. For example, a lawyer could send a contract to a client via email. PDFs were never designed for our viewing pleasure!!! True, Adobe has tried to update PDF over the years, but it is still THE WORST form of document for reading on an electronic device.
  Reply With Quote
Old 10-14-2011, 11:23 AM   #5
HarryT
eBook Enthusiast
HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.
 
HarryT's Avatar
 
Posts: 85,544
Karma: 93383043
Join Date: Nov 2006
Location: UK
Device: Kindle Oasis 2, iPad Pro 10.5", iPhone 6
Quote:
Originally Posted by tentimes View Post
I am confused about why there is no program out there that can take the textual information in a pdf book, plus the index (bookmarks) and turn it into a an indexed book.
Quite simply because there IS no "textual information" in a PDF document. A PDF document doesn't contain paragraphs, sentences, and words. All that it contains is drawing instructions of the form "draw this shape at these coordinates".

A PDF document is essentially a series of instructions for drawing a picture on a sheet of paper. It's not a book.
HarryT is offline   Reply With Quote
Old 10-14-2011, 12:10 PM   #6
gweminence
Fat Guy
gweminence can name that song in three notesgweminence can name that song in three notesgweminence can name that song in three notesgweminence can name that song in three notesgweminence can name that song in three notesgweminence can name that song in three notesgweminence can name that song in three notesgweminence can name that song in three notesgweminence can name that song in three notesgweminence can name that song in three notesgweminence can name that song in three notes
 
Posts: 408
Karma: 24165
Join Date: Jun 2010
Device: Kindle Voyage
PDF as the holy grail of ereaders? No. Just no.

As stated, they're a print format.
gweminence is offline   Reply With Quote
Old 10-14-2011, 12:19 PM   #7
jswinden
Nameless Being
 
Quote:
Originally Posted by HarryT View Post
Quite simply because there IS no "textual information" in a PDF document. A PDF document doesn't contain paragraphs, sentences, and words. All that it contains is drawing instructions of the form "draw this shape at these coordinates".

A PDF document is essentially a series of instructions for drawing a picture on a sheet of paper. It's not a book.
Good point. PDF documents are basically page layout documents. The kind of documents we used to create in PageMaker or Quark Express. The kind of documents that lend themselves well to brochures, posters, et cetera.

Electronic readers (E Ink, computers, tablets, phones, etc.) work best with reflowable, dynamic text.

PDF documents by nature are page layout, static output typically designed for one specific page size.
  Reply With Quote
Old 10-14-2011, 12:47 PM   #8
Zeebra
Evangelist
Zeebra ought to be getting tired of karma fortunes by now.Zeebra ought to be getting tired of karma fortunes by now.Zeebra ought to be getting tired of karma fortunes by now.Zeebra ought to be getting tired of karma fortunes by now.Zeebra ought to be getting tired of karma fortunes by now.Zeebra ought to be getting tired of karma fortunes by now.Zeebra ought to be getting tired of karma fortunes by now.Zeebra ought to be getting tired of karma fortunes by now.Zeebra ought to be getting tired of karma fortunes by now.Zeebra ought to be getting tired of karma fortunes by now.Zeebra ought to be getting tired of karma fortunes by now.
 
Zeebra's Avatar
 
Posts: 461
Karma: 956567
Join Date: Oct 2010
Location: Toronto, Canada
Device: Kindle Oasis 3
Quote:
Originally Posted by gweminence View Post
PDF as the holy grail of ereaders? No. Just no.

As stated, they're a print format.
I don't think the original poster is saying PDFs as the holy grail, he's looking for a conversion application that would be the "holy grail" of conversions to accurately convert a PDF to a good version of an ebook. No such conversion program exists that doesn't mess things up in some way. PDFs do suck as ebooks.
Zeebra is offline   Reply With Quote
Old 10-14-2011, 01:04 PM   #9
shinew
Addict
shinew ought to be getting tired of karma fortunes by now.shinew ought to be getting tired of karma fortunes by now.shinew ought to be getting tired of karma fortunes by now.shinew ought to be getting tired of karma fortunes by now.shinew ought to be getting tired of karma fortunes by now.shinew ought to be getting tired of karma fortunes by now.shinew ought to be getting tired of karma fortunes by now.shinew ought to be getting tired of karma fortunes by now.shinew ought to be getting tired of karma fortunes by now.shinew ought to be getting tired of karma fortunes by now.shinew ought to be getting tired of karma fortunes by now.
 
Posts: 309
Karma: 1008082
Join Date: Feb 2009
Location: NYC
Device: Kindle PW, K4 Touch, iPad2, Samsung Galaxy S II
I pretty much read PDFs only on my ipad2. There won't be any conversion available that'll replicate a somewhat elaborate PDF format of the original document as long as kindle still uses mobi.
If I were you and read lots of PDF, I would just get a tablet(Fire perhaps).
shinew is offline   Reply With Quote
Old 10-14-2011, 01:07 PM   #10
HarryT
eBook Enthusiast
HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.
 
HarryT's Avatar
 
Posts: 85,544
Karma: 93383043
Join Date: Nov 2006
Location: UK
Device: Kindle Oasis 2, iPad Pro 10.5", iPhone 6
And if you DO want to convert a PDF document to text, use an OCR program such ad Abbyy Finereader. It will do a much better job than direct conversion tools.
HarryT is offline   Reply With Quote
Old 10-14-2011, 01:22 PM   #11
DiapDealer
Grand Sorcerer
DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.
 
DiapDealer's Avatar
 
Posts: 27,548
Karma: 193191846
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
PDF is a destination only format. You convert things to PDF... not from PDF. It's primarily a dead-end, one-way street. You can't get there from here. It will never convert well to any other ebook format without a lot of hands-on manual tweaking. You may get lucky here and there on a few documents, but mostly, nothing but misery will come of it.

If you don't have the source documents that the PDF was created from, you better: A) read it on a device that already works well with PDF documents. Or B) get ready to get your hands dirty converting and still end up being unhappy with the results. It is what it is. There is no magic.

Last edited by DiapDealer; 10-14-2011 at 01:26 PM.
DiapDealer is offline   Reply With Quote
Old 10-14-2011, 03:09 PM   #12
Blossom
Treasure Seeker
Blossom ought to be getting tired of karma fortunes by now.Blossom ought to be getting tired of karma fortunes by now.Blossom ought to be getting tired of karma fortunes by now.Blossom ought to be getting tired of karma fortunes by now.Blossom ought to be getting tired of karma fortunes by now.Blossom ought to be getting tired of karma fortunes by now.Blossom ought to be getting tired of karma fortunes by now.Blossom ought to be getting tired of karma fortunes by now.Blossom ought to be getting tired of karma fortunes by now.Blossom ought to be getting tired of karma fortunes by now.Blossom ought to be getting tired of karma fortunes by now.
 
Blossom's Avatar
 
Posts: 18,708
Karma: 26026435
Join Date: Mar 2010
Device: Kobo HD Glo, Kindles, Kindle Fires, Andriod Devices
Quote:
Originally Posted by Zeebra View Post
I don't think the original poster is saying PDFs as the holy grail, he's looking for a conversion application that would be the "holy grail" of conversions to accurately convert a PDF to a good version of an ebook. No such conversion program exists that doesn't mess things up in some way. PDFs do suck as ebooks.
Agreed. The best pdf conversions I have gotten is to use Acrobat Pro converting to html 3.2 but it will have broken sentences. I fix those in Word using regex expressions. If the pdf isn't tagged though Acrobat Pro wont convert it. I then have to used Mobipocket creator which creates a less clean html file but I work my way from there.
Blossom is offline   Reply With Quote
Old 10-14-2011, 06:18 PM   #13
Snorkledorf
Blue. Not sad...just blue
Snorkledorf ought to be getting tired of karma fortunes by now.Snorkledorf ought to be getting tired of karma fortunes by now.Snorkledorf ought to be getting tired of karma fortunes by now.Snorkledorf ought to be getting tired of karma fortunes by now.Snorkledorf ought to be getting tired of karma fortunes by now.Snorkledorf ought to be getting tired of karma fortunes by now.Snorkledorf ought to be getting tired of karma fortunes by now.Snorkledorf ought to be getting tired of karma fortunes by now.Snorkledorf ought to be getting tired of karma fortunes by now.Snorkledorf ought to be getting tired of karma fortunes by now.Snorkledorf ought to be getting tired of karma fortunes by now.
 
Snorkledorf's Avatar
 
Posts: 218
Karma: 1267018
Join Date: Oct 2009
Location: Japan
Device: Ridibooks Paper Pro
My recent tool of choice has been PDFMasher http://www.hardcoded.net/pdfmasher/ which converts PDF to epub and mobi, via Markdown and thus HTML formats. I've been finding the relative simplicity of editing the Markdown file (using BBEdit) to be convenient enough that I've actually converted a half-dozen or so books into mobi, instead of procrastinating on them like I've been doing for ages.

While PDFMasher doesn't seem to retain the bookmarks like the OP wanted, it does have the ability to remove extraneous PDF stuff like headers & footers that would interfere with reflowed text. E.g. you can sort all the elements it finds on all the pages at once, by how high/low they are on the page. The highest/lowest elements are likely to be page numbers and you can select them all and say "Ignore" and they're gone from the output text.

Still takes a lot of massaging to get clean output, but it's progress...
Snorkledorf is offline   Reply With Quote
Old 10-15-2011, 06:53 AM   #14
tentimes
Junior Member
tentimes began at the beginning.
 
Posts: 6
Karma: 10
Join Date: Oct 2011
Device: Kindle 4
Is it a certainty that pdf books do not contain a load of text boxes with the actual text still decipherable? As in, I would doubt that it is a bitmip of the font etc. Apologies if I am wrong, but I am going through it now with a hex editor trying to make sense of it.

I thought that rather than go the whole hog of OCR that we would be able to get a series of draw text commands, the text being in boxes, and taking that all together, with an intelligent interpretation of the paragraphs that it might be possible, as opposed to going the whole OCR hog.

I know I am new to this, but with the size of the files relative to pages in the book I would be surprised if this wasn't the case.

If it's a matter of a series of text boxes per page, then it's a matter of (assuming most pages don't overlap these box areas and overprint) taking the text boxes in order, getting the relative font sizes, assuming the large font sizes with the text form "Chapter XX" are start of chapter of there is no internal byte code to five you end of chapter (which I bet there is), and doing a *slightly* (and it really is slightly in terms of logic) better job of interpreting the logic.

Can anyone point me in the direction of a good dissection of PDF as a format please? I think I am going to have a go at this. If I do, then I undertake to make it open source. If I paid for a book once I'm not paying for it again,
tentimes is offline   Reply With Quote
Old 10-15-2011, 06:57 AM   #15
HarryT
eBook Enthusiast
HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.
 
HarryT's Avatar
 
Posts: 85,544
Karma: 93383043
Join Date: Nov 2006
Location: UK
Device: Kindle Oasis 2, iPad Pro 10.5", iPhone 6
Quote:
Originally Posted by tentimes View Post
Is it a certainty that pdf books do not contain a load of text boxes with the actual text still decipherable? As in, I would doubt that it is a bitmip of the font etc. Apologies if I am wrong, but I am going through it now with a hex editor trying to make sense of it.
You are wrong. Most PDF documents do NOT contain text. Some have a text "layer" in them, which DOES contain searchable text - this is generally added by OCR.
HarryT is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
KINDLE DEAL: The Holy Bible: NKJV ($3.36 CANADA) gospelebooks Deals and Resources (No Self-Promotion or Affiliate Links) 2 04-09-2011 12:07 PM
Free Book (Kindle / Nook) - The Holy Bible koland Deals and Resources (No Self-Promotion or Affiliate Links) 21 11-14-2010 01:51 PM
Free Book (Kindle) - The Holy Bible koland Deals and Resources (No Self-Promotion or Affiliate Links) 21 10-09-2010 10:31 AM
Free Book (Kindle) - Holy Bible (GW) koland Deals and Resources (No Self-Promotion or Affiliate Links) 0 10-04-2010 03:29 AM
The search for the Holy Grail of reading lights continues Bob Russell News 19 04-01-2009 01:24 PM


All times are GMT -4. The time now is 11:54 PM.


MobileRead.com is a privately owned, operated and funded community.