05-21-2012, 12:36 PM | #1 |
Zealot
Posts: 125
Karma: 1370
Join Date: Mar 2012
Device: none
|
pdf files to epub OR ibooks
I have a set of pdfs that I need to export to epub. The request was actually for this to be done in IBooksAuthor but I will get to that after I explain. I'm no epub guru and know a little about this after doing a bunch of tests.
The client has the files in a pdf and after looking at it the pdf consists mostly of text, after extracting the pdf into a few separate pdfs I began to test. First in Pages / Looking at the .epub file in an IPad both views are too small, you cannot see the text within the pdf then Iba/ Looking at the .ibooks file in an IPad same thing here with that known before I inform the client, I need to know the following if there is a way to extract elements of a pdf file and convert into text and preserve the styling? don't think so which means plan B which is to have the client send the files in the original app that the files were created in Microsoft publisher. That is bad as I heard that those files can only be opened in THAT app. Do you know of a way to convert that type of file into a word doc? Last, lets say the client gets me the files in either word or another app like InDesign am I correct in assuming that the paragraph styles, images, fonts and any other elements must be saved and included when copying the files to either iba or Pages? If not I have to do this from scratch for each page? anyone? RD |
05-21-2012, 01:44 PM | #2 |
Grand Sorcerer
Posts: 11,470
Karma: 13095790
Join Date: Aug 2007
Location: Grass Valley, CA
Device: EB 1150, EZ Reader, Literati, iPad 2 & Air 2, iPhone 7
|
PDF is the worst source document format to convert. Can you get a hold of the source documents in their original format? That would be much better. There should be a way to export even Microsoft publisher into some other format.
|
05-21-2012, 02:40 PM | #3 |
Zealot
Posts: 125
Karma: 1370
Join Date: Mar 2012
Device: none
|
Well I figured either two ways. If the client can get me those files in word or Indesign then no matter what I have to re style the files for epub. I just finished copying and pasting text from one of the pdf pages I converted to ms. word client must understand that this will add a lot of time to the project as I have to take the background image, make it a template in iba or pages which ever the client chooses then style each paragraph to get close to the pdf style.
If you know of an easier way let me know. RD |
05-21-2012, 05:53 PM | #4 | |
Zealot
Posts: 119
Karma: 64428
Join Date: Aug 2011
Device: none
|
Quote:
As stated before, PDF is the WORST format to try to convert from. However, one of the amazing capabilities of the Calibre program is its ability to convert many PDFs. Usually, I have no particular need for the library management and synching features of Calibre, so I use just the batch convert program, called ebook-convert. It's not very well-known, but is part of the Calibre package. Use this to convert from PDF to HTML. This is not a useless step because EPUB is just HTML+CSS. You can check your progress from time to time by looking at the HTML with your browser. Once you have this polished to your satisfaction, convert the HTML to EPUB using Sigil directly on the HTML. (Don't panic! This can take a while.) Remember to "save as" an EPUB file. Polish the EPUB a little more with Sigil. Proofread carefully, comparing to the original PDF, and you're done! The biggest problem with the PDF format is that it discards the entire document structure. Even paragraph boundaries are lost. Everything is just pixels turned on at an x-y location on some virtual paper. Even so, the Calibre people have done some amazing things to even recognize most paragraphs, although not always perfectly. Graphic elements are saved with links to them in the text. Likewise, styling touches such as italics and bold are handled nicely. Even headings and indents are usually recognized. However, if there is more than one column of text or a multi-column table, you've got some work to do because these are converted in the order found in the PDF, that is, linearized. For instance, a two-column table might have all of column 1 converted first into appropriate HTML, followed by all of column 2. There's no way to tell from the PDF that it's even displaying a table. Newspaper columns sometimes have the opposite result: the lines of the columns are interleaved. One final observation. Proofreading takes much more time than any of the converting and editing steps mentioned above. At least you know you have something to work with that's close to what you want. |
|
05-22-2012, 01:45 AM | #5 |
Zealot
Posts: 121
Karma: 5070
Join Date: Dec 2010
Device: none
|
PDF is a format not made to exchange data but for displaying data to humans in exact the way the publuisher wants it to be displayed. At leaqst there is no formatting at all, headers are big and bold, and paragraphs are not bound together buu only have margins to other objects.
Calibre produces shit as every other converting tool, too. In that case better strip all CSS rules and rebuild the book from scratch. Even after that, you have paragrpahs wich belong together, soft hyphens not being taken back, misinterpreted italics and so on and so one. The best way of converting is to use on OCR program like Abby Finereader or Omnioage, export the text as .doc, open it with atlantis word processor, checking all misinterpreted sectioins and then convert it to epub ( atlantis do have a suitable epub export), Even after that you need to check the result with sigil or any other source based tool. There is NO easy process from pdf to epub regardless what the tool programmer promises. The very best way is to have another format. |
05-22-2012, 07:05 AM | #6 |
Zealot
Posts: 125
Karma: 1370
Join Date: Mar 2012
Device: none
|
Great info to know. As I'm a web developer and just got a client who has a lot of epub / ibook projects.
thx RD |
05-22-2012, 07:43 AM | #7 |
Color me gone
Posts: 2,089
Karma: 1445295
Join Date: Apr 2008
Location: Central Oregon Coast
Device: PRS-300
|
You might want to get clear in your own mind the capabilities of epub vs ibooks.
To my mind, epub is more for e-ink readers whereas ibooks are for tablet computers. Both have their advantages and limitations, but they are not the same, though derived from the basic epub source. Its important to make sure the client does not imagine full motion video or animation in e-ink books, for example. PDF is a terrible source, even plain text is better because you don't have to rip anything out. As for the text size problem, you may be able to look at epub in Sigil and see what is making the text small. That problem should be relatively easy to see for a web developer. However rooting out hundreds or thousands of extra entries will test your patience and sanity. Hopefully regular expressions are part of your toolkit. They can reduce insanity to mere extreme aggravation! |
05-22-2012, 08:25 AM | #8 |
Wizard
Posts: 4,520
Karma: 121692313
Join Date: Oct 2009
Location: Heemskerk, NL
Device: PRS-T1, Kobo Touch, Kobo Aura
|
iBooks is not for tablet computers, it is for iPads. Not all tablets are Apple! On other tables ePUB will work fine. iBooks is limited to Apple products.
iBooks is a derivate of the ePUB standard. |
05-22-2012, 09:06 AM | #9 | |
Booklegger
Posts: 1,801
Karma: 7999816
Join Date: Jun 2009
Location: Toronto, Ontario, Canada
Device: BeBook(1 & 2010), PEZ, PRS-505, Kobo BT, PRS-T1, Playbook, Kobo Touch
|
Quote:
|
|
05-22-2012, 01:39 PM | #10 |
Grand Sorcerer
Posts: 11,470
Karma: 13095790
Join Date: Aug 2007
Location: Grass Valley, CA
Device: EB 1150, EZ Reader, Literati, iPad 2 & Air 2, iPhone 7
|
|
05-27-2012, 10:44 PM | #11 |
Resident Curmudgeon
Posts: 73,668
Karma: 127838212
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
|
iBooks is also for iPhones sort of. The stupid margins make it an abomination on an iPhone.
|
05-27-2012, 11:53 PM | #12 | |
Bookmaker & Cat Slave
Posts: 11,448
Karma: 157030631
Join Date: Apr 2010
Location: Phoenix, AZ
Device: K2, iPad, KFire, PPW, Voyage, NookColor. 2 Droid, Oasis, Boox Note2
|
Quote:
But every other "convert from PDF" program out there produces exactly what Huebi said it does. I opened two files this past week that gave me "bad juju" (in the original sense, not the new urban sense), and exported them out in html, only to be starting at every single word having not one, but two, spans; a span encompassing the first letter of the word (inexplicably--no formatting that would explain that) and the remainder of the word. This is invariably the result of attempting to export PDF to either Word or html; I've yet to see any program function differently. You'll have to be prepared for a boatload of proofing in html or in Word, just for the text issues, and THEN in html, for the garbage-code issues. If you're doing this for yourself, that's fine--but for a client? You need to clean out that code, particularly if you're attempting to format for iBooks, which is notoriously finicky; those rampant wild spans will wreak havoc with any text-formatting you intend to add/keep. Be prepared for a LOT of regex. Using iBooks iAuthor won't change any of that. Jut my $.02, Hitch |
|
05-28-2012, 12:31 PM | #13 |
Resident Curmudgeon
Posts: 73,668
Karma: 127838212
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
|
And one other thing you need to be prepared for is you have to do a 100% A/B comparison to make sure your conversion has not cause any errors with the text. And from a PDF conversion there are errors with the text you will have to fix.
|
Thread Tools | Search this Thread |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
TidBITS: iBooks Now Opens EPUB Files Directly | kjk | Apple Devices | 4 | 04-07-2011 03:07 PM |
Send PDF files to iBooks? | itimpi | Devices | 2 | 02-11-2011 02:49 PM |