07-18-2012, 07:46 AM | #31 | |
Member
Posts: 24
Karma: 505048
Join Date: Jul 2012
Device: Samsung Galaxy Mega 5.8, Kobo Mini
|
Quote:
What I did was save it as RTF. Then convert it to epub. It was closer to the original format that I want. If it epub is a mess, try converting it to lit, then to epub. Or RTF > mobi > epub. It usually works. |
|
07-18-2012, 07:49 AM | #32 |
rocketdocs developer
Posts: 5
Karma: 10
Join Date: Jul 2012
Location: Ottawa, Canada
Device: iPad
|
I understand your skepticism and completely agree with you that there is no other tool on the market that can help you convert a PDF to HTML in a consistent and accurate form.
First, let me make myself perfectly clear, our software doesn't fully automate this process as many others try to do because that's just a futile attempt. We've gone down that road and there are too many variables in play. Also, absolute positioned HTML elements is just a ridiculous notion. To your point(s) you can't just take a PDF and run it through some magic tool and expect it to spit out perfect HTML every time, but you can use typography techniques and other algorithms to make sense of all the underlying PDF code and create those "semantic units" as you call them. We've got this working to about 80% accuracy already and will only get better over time. The other 20% is using our web-based editor to tell our software what those semantics are (i.e. p, h1, ul, footnote, etc.). Here's an example of a PDF we converted: http://bit.ly/Nzxrmq. It's 126 pages and it might have taken us a day to convert to the WCAG 2.0 compliant HTML and EPUB you see on that page. |
Advert | |
|
07-18-2012, 08:02 AM | #33 |
Calibre Plugins Developer
Posts: 4,636
Karma: 2162064
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
So you end up with an approximation of the structure of the original document. Now whether your reliability is 80% or 90% or 60% - the simple fact remains that paragraph structure has been lost, from a source document that did originally have it. So you are now required to do a line by line A/B comparison of the Word document and the resulting html to ensure such structure is rectified. Which is no different to any other PDF conversion tool out there. Sure your algorithms might be better than others, but unless it is 100% retaining paragraph structure there is always that element of additional review/editing of every page that is now required.
I can appreciate that there are users out there for whom "close enough is good enough" but I am not among them I'm afraid. |
07-18-2012, 08:13 AM | #34 |
rocketdocs developer
Posts: 5
Karma: 10
Join Date: Jul 2012
Location: Ottawa, Canada
Device: iPad
|
After accurately converting thousands of PDF files for the Government of Canada, which if you knew anything about their rigorous internal testing before it gets posted to their sites, you would see that its no longer an approximation but 100% reliability once it comes out of our software.
Do you have to do an A/B test against the original document, sure, but it's not line by line it's page by page to make sure all the elements are where they should be. That's a lot better than having to code your own HTML & CSS isn't it? |
07-18-2012, 08:31 AM | #35 | |
Wizard
Posts: 3,413
Karma: 13369310
Join Date: May 2008
Location: Launceston, Tasmania
Device: Sony PRS T3, Kobo Glo, Kindle Touch, iPad, Samsung SB 2 tablet
|
Quote:
- I get the print version as a Word document, - turn it into basic HTML documents (usually one per chapter) with Atlantis - turn the documents into valid XHTML 1.1 documents by hand using an HTML editor - I have a standard CSS file which I modify as required. - put the documents into an ePub template and edit the content.opf and toc.ncx files as required - validate the resulting ePub file through FlightCrew and ePubCheck. |
|
Advert | |
|
07-18-2012, 08:40 AM | #36 | ||
US Navy, Retired
Posts: 9,864
Karma: 13806776
Join Date: Feb 2009
Location: North Carolina
Device: Icarus Illumina XL HD, Nexus 7
|
Quote:
Quote:
|
||
07-18-2012, 08:46 AM | #37 | |
Calibre Plugins Developer
Posts: 4,636
Karma: 2162064
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
Quote:
There are other ways of working with Word documents which avoid such a chore that can be done in a fraction of the time. Which is why PDF conversions are best avoided in my opinion. That you guys are having success with it for government documents - well done and great for you. But I *still* don't see it as a recommended general approach. Anyways I've done my dash on this, appreciate the debate and best of luck to you. |
|
07-18-2012, 10:29 AM | #38 | |
Resident Curmudgeon
Posts: 73,942
Karma: 128903250
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
|
Quote:
|
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Fun converting Word to epub | DebbyS | Conversion | 2 | 10-09-2011 03:27 AM |
Number of HTML converting to EPUB | HoushaSen | Conversion | 11 | 08-16-2011 07:49 AM |
Converting Word Doc with Tables to Epub? | dhume01 | ePub | 8 | 12-28-2010 08:02 PM |
Converting from Word Perfect to epub | PhishStyx | Sigil | 10 | 05-17-2010 04:49 PM |