Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Conversion

Notices

Reply
 
Thread Tools Search this Thread
Old 07-18-2012, 07:46 AM   #31
fullybook
Member
fullybook ought to be getting tired of karma fortunes by now.fullybook ought to be getting tired of karma fortunes by now.fullybook ought to be getting tired of karma fortunes by now.fullybook ought to be getting tired of karma fortunes by now.fullybook ought to be getting tired of karma fortunes by now.fullybook ought to be getting tired of karma fortunes by now.fullybook ought to be getting tired of karma fortunes by now.fullybook ought to be getting tired of karma fortunes by now.fullybook ought to be getting tired of karma fortunes by now.fullybook ought to be getting tired of karma fortunes by now.fullybook ought to be getting tired of karma fortunes by now.
 
Posts: 24
Karma: 505048
Join Date: Jul 2012
Device: Samsung Galaxy Mega 5.8, Kobo Mini
Lightbulb

Quote:
Originally Posted by arturox View Post
Have constructed a Wordfile it contains text and graphics, everything is in its place.
Save it out from Word as HTML, then put that through Calibre (Latest version 0.8.59) get it in the Calibre Epub viewer and it's a mess.

The messup problem only happens after the Calibre convert.
Any thoughts please?
I encountered the same problem. I've also tried what you did with the file but to no avail. It was a mess.

What I did was save it as RTF. Then convert it to epub. It was closer to the original format that I want. If it epub is a mess, try converting it to lit, then to epub. Or RTF > mobi > epub.

It usually works.
fullybook is offline   Reply With Quote
Old 07-18-2012, 07:49 AM   #32
rocketdocs
rocketdocs developer
rocketdocs began at the beginning.
 
rocketdocs's Avatar
 
Posts: 5
Karma: 10
Join Date: Jul 2012
Location: Ottawa, Canada
Device: iPad
I understand your skepticism and completely agree with you that there is no other tool on the market that can help you convert a PDF to HTML in a consistent and accurate form.

First, let me make myself perfectly clear, our software doesn't fully automate this process as many others try to do because that's just a futile attempt. We've gone down that road and there are too many variables in play. Also, absolute positioned HTML elements is just a ridiculous notion.

To your point(s) you can't just take a PDF and run it through some magic tool and expect it to spit out perfect HTML every time, but you can use typography techniques and other algorithms to make sense of all the underlying PDF code and create those "semantic units" as you call them. We've got this working to about 80% accuracy already and will only get better over time. The other 20% is using our web-based editor to tell our software what those semantics are (i.e. p, h1, ul, footnote, etc.).

Here's an example of a PDF we converted: http://bit.ly/Nzxrmq. It's 126 pages and it might have taken us a day to convert to the WCAG 2.0 compliant HTML and EPUB you see on that page.
rocketdocs is offline   Reply With Quote
Advert
Old 07-18-2012, 08:02 AM   #33
kiwidude
Calibre Plugins Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,636
Karma: 2162064
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
So you end up with an approximation of the structure of the original document. Now whether your reliability is 80% or 90% or 60% - the simple fact remains that paragraph structure has been lost, from a source document that did originally have it. So you are now required to do a line by line A/B comparison of the Word document and the resulting html to ensure such structure is rectified. Which is no different to any other PDF conversion tool out there. Sure your algorithms might be better than others, but unless it is 100% retaining paragraph structure there is always that element of additional review/editing of every page that is now required.

I can appreciate that there are users out there for whom "close enough is good enough" but I am not among them I'm afraid.
kiwidude is offline   Reply With Quote
Old 07-18-2012, 08:13 AM   #34
rocketdocs
rocketdocs developer
rocketdocs began at the beginning.
 
rocketdocs's Avatar
 
Posts: 5
Karma: 10
Join Date: Jul 2012
Location: Ottawa, Canada
Device: iPad
After accurately converting thousands of PDF files for the Government of Canada, which if you knew anything about their rigorous internal testing before it gets posted to their sites, you would see that its no longer an approximation but 100% reliability once it comes out of our software.

Do you have to do an A/B test against the original document, sure, but it's not line by line it's page by page to make sure all the elements are where they should be. That's a lot better than having to code your own HTML & CSS isn't it?
rocketdocs is offline   Reply With Quote
Old 07-18-2012, 08:31 AM   #35
AlexBell
Wizard
AlexBell ought to be getting tired of karma fortunes by now.AlexBell ought to be getting tired of karma fortunes by now.AlexBell ought to be getting tired of karma fortunes by now.AlexBell ought to be getting tired of karma fortunes by now.AlexBell ought to be getting tired of karma fortunes by now.AlexBell ought to be getting tired of karma fortunes by now.AlexBell ought to be getting tired of karma fortunes by now.AlexBell ought to be getting tired of karma fortunes by now.AlexBell ought to be getting tired of karma fortunes by now.AlexBell ought to be getting tired of karma fortunes by now.AlexBell ought to be getting tired of karma fortunes by now.
 
AlexBell's Avatar
 
Posts: 3,413
Karma: 13369310
Join Date: May 2008
Location: Launceston, Tasmania
Device: Sony PRS T3, Kobo Glo, Kindle Touch, iPad, Samsung SB 2 tablet
Quote:
Originally Posted by arturox View Post
I guess it would be worth looking at this problem from a different angle and asking the experienced users here...

If you were starting from scratch with some text and a couple of graphics, how would you (And in what) construct/re-construct a document that was eventually going to be converted to an Epub file?

Arturo X
For what it is worth I design ebooks for Circaidy Gregory Press.

- I get the print version as a Word document,
- turn it into basic HTML documents (usually one per chapter) with Atlantis
- turn the documents into valid XHTML 1.1 documents by hand using an HTML editor - I have a standard CSS file which I modify as required.
- put the documents into an ePub template and edit the content.opf and toc.ncx files as required
- validate the resulting ePub file through FlightCrew and ePubCheck.
AlexBell is offline   Reply With Quote
Advert
Old 07-18-2012, 08:40 AM   #36
DoctorOhh
US Navy, Retired
DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.
 
DoctorOhh's Avatar
 
Posts: 9,864
Karma: 13806776
Join Date: Feb 2009
Location: North Carolina
Device: Icarus Illumina XL HD, Nexus 7
Quote:
Originally Posted by rocketdocs View Post
What we found works best is to first convert it to PDF, then convert the PDF to HTML and then to EPUB. Converting to PDF first gets rid of all that junk Word outputs and gives you a better baseline to work with.
Stating this, without qualification, in the calibre forum makes it sound as if you are advising this workflow using calibre which is really bad advice.

Quote:
Originally Posted by AlexBell View Post
For what it is worth I design ebooks for Circaidy Gregory Press.

- I get the print version as a Word document,
- turn it into basic HTML documents (usually one per chapter) with Atlantis
- turn the documents into valid XHTML 1.1 documents by hand using an HTML editor - I have a standard CSS file which I modify as required.
- put the documents into an ePub template and edit the content.opf and toc.ncx files as required
- validate the resulting ePub file through FlightCrew and ePubCheck.
Sound advice.
DoctorOhh is offline   Reply With Quote
Old 07-18-2012, 08:46 AM   #37
kiwidude
Calibre Plugins Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,636
Karma: 2162064
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
Quote:
Originally Posted by rocketdocs View Post
Do you have to do an A/B test against the original document, sure, but it's not line by line it's page by page to make sure all the elements are where they should be. That's a lot better than having to code your own HTML & CSS isn't it?
Well there we shall agree to disagree . Personally (and I have done it countless times) I find going through 400+ page novels and correcting paragraph structure from a PDF conversion a mind-numbingly dull task that takes far too many hours. All the while trying not to pay too close attention to the text so as to not "spoil" my later enjoyment when I do read the book.

There are other ways of working with Word documents which avoid such a chore that can be done in a fraction of the time. Which is why PDF conversions are best avoided in my opinion. That you guys are having success with it for government documents - well done and great for you. But I *still* don't see it as a recommended general approach.

Anyways I've done my dash on this, appreciate the debate and best of luck to you.
kiwidude is offline   Reply With Quote
Old 07-18-2012, 10:29 AM   #38
JSWolf
Resident Curmudgeon
JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.
 
JSWolf's Avatar
 
Posts: 73,942
Karma: 128903250
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
Quote:
Originally Posted by rocketdocs View Post
@kiwidude - I've read a few of the stickies and posts about the "limitations" of the PDF format, but they are only limitations as it pertains to calibre and not PDF itself.

Don't get me wrong, PDF is a tricky beast to tame, that's for sure, but like I said, we've spent 4 years converting thousands of PDFs to strict HTML standards for the Government of Canada so I'm definitely qualified to make that statement.

I saw posts on here that say column layouts can't be converted or forget tables, they're impossible to extract. We've been doing these for years now with our software.
Not even Adobe Acrobat Pro can convert PDF without errors. So it is not just a Calibre issue It's the fact that PDF was never designed to be converted to anything else. It was designed so that you could create something in a given program, print it as PDF, send it to someone else who doesn't have the same program and can print it out as you intended.
JSWolf is online now   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Fun converting Word to epub DebbyS Conversion 2 10-09-2011 03:27 AM
Number of HTML converting to EPUB HoushaSen Conversion 11 08-16-2011 07:49 AM
Converting Word Doc with Tables to Epub? dhume01 ePub 8 12-28-2010 08:02 PM
Converting from Word Perfect to epub PhishStyx Sigil 10 05-17-2010 04:49 PM


All times are GMT -4. The time now is 09:36 AM.


MobileRead.com is a privately owned, operated and funded community.