MobileRead Forums - View Single Post - Are things going ahead on the PDF-to-html front?

Lbooker · 03-16-2012, 07:52 AM

Quote:

Originally Posted by JSWolf

There is no way to convert a complex or text based PDF of some decent length without errors.

Reducing the error percentage of these conversions is a rational challenge for the human mind.

Quote:

Originally Posted by roffLOL

I've been busy with studies and work for several of months. Lastly I worked on a conversion for a horrible PDF file.

removed link to copyrighted material

I'm pretty sure that when I get a nice result from it the code should be able to handle just about any case with grace. But I'm not there yet. Will try to pull myself together soon.

Great news ! Do you know if the other members of the calibre team are still actively pursuing the same effort ? And do you plan to use some of the GPL code out there on the web, like this one :
http://pdftohtml.sourceforge.net/
Check his demo ! His code manages to convert a complex document nicely.
I just found out pdftohtml is now part of poppler-utils. I played with it, turned a pdf into hundreds of html files, but calibre will not accept them as one book. I also turned this pdf into an xml file, but calibre does not accept xml as input.
Well, with the -s and -i option, I managed to create a single html file that calibre converted into epub, but the outcome is no better that what calibre would have directly done with the pdf file.
So the problem lies in the conversion from html to epub.