Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Conversion

Notices

Reply
 
Thread Tools Search this Thread
Old 03-26-2012, 01:40 PM   #1
prcek
Junior Member
prcek began at the beginning.
 
Posts: 4
Karma: 10
Join Date: Mar 2012
Device: iPad3,Kindle DX+Fire
PDF -> HTML produces bogus PNG's

I'm trying to convert a bunch of PDF's to MOBI and for some reason the conversion to HTML produces a ton of JPG's but also PNG's and those PNG's are almost all bogus (all gray or similar junk). I've tried Calibre and other tools and they all do the same thing. I tried building pdftohtml (the one in poppler) on my Linux box (I gave up on Windows) and found the following code in utils\HtmlOutputDev.cc near line 1200:


void HtmlOutputDev::drawImage(GfxState *state, Object *ref, Stream *str,
int width, int height, GfxImageColorMap *colorMap,
GBool interpolate, int *maskColors, GBool inlineImg) {

...
...

if (dumpJPEG && str->getKind() == strDCT) {
...
...
else {
#ifdef ENABLE_LIBPNG
// Dump the image as a PNG file. Much of the PNG code
// comes from an example by Guillaume Cottenceau.
...
...
#else
OutputDev::drawImage(state, ref, str, width, height, colorMap, interpolate,
maskColors, inlineImg);
#endif
}
}

If I simply disable the LIBPNG section above all of the images come out as JPG's (and they all look OK), but I've no idea whether this is a reasonable workaround and if so how to make a new calibre with this hacked pdftohtml.

Specifically, I use calibre under Win7 and after hours of trying to build poppler on Windows (based on various sets of instructions I found on the web) I've concluded it's currently beyond my ability / patience. It's also not entirely clear to me how to use a custom build of poppler with calibre - that is, let's say some kind soul tells me how to get it built on Win7 (VS2008 or VS2010 would be best but I also have cygwin and mingw and git and who knows what else installed), do I simply copy pdftohtml.exe to the calibre directory or is there more to it than that?

Any pointers would be greatly appreciated - thanks!

PeterK
prcek is offline   Reply With Quote
Old 03-26-2012, 01:55 PM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,858
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Look at windows_notes.rst in the calibre source to learn how to build poppler and everything else calibre depends on. Then simply replace pdftohtml.exe in in the calibre install directory with your version.
kovidgoyal is offline   Reply With Quote
Advert
Old 03-27-2012, 01:17 AM   #3
prcek
Junior Member
prcek began at the beginning.
 
Posts: 4
Karma: 10
Join Date: Mar 2012
Device: iPad3,Kindle DX+Fire
Thanks for the pointer; I am now able to build pdftohtml.exe on Windows, but to my chagrin it looks like Calibre is using pdftoxml.exe instead, and I don't see any way of building that using the VS2008 solution produced by CMake. Any suggestions?

Thanks again!

PeterK
prcek is offline   Reply With Quote
Old 03-27-2012, 01:25 AM   #4
prcek
Junior Member
prcek began at the beginning.
 
Posts: 4
Karma: 10
Join Date: Mar 2012
Device: iPad3,Kindle DX+Fire
Oops, I invoked MobiPocketCreator by mistake, that's why I saw pdftoxml - that was pretty dumb of me; sorry! It would be nice to be able to build pdftoxml as well but maybe it's not even open source. Anyway, I've invoked the right program and now it's crashing with my custom pdftohtml so I have something to debug.

Thanks
PeterK
prcek is offline   Reply With Quote
Old 03-29-2012, 07:34 PM   #5
prcek
Junior Member
prcek began at the beginning.
 
Posts: 4
Karma: 10
Join Date: Mar 2012
Device: iPad3,Kindle DX+Fire
I finally figured out more about what's going on. It appears that some of these books have quite a few pages where DCT and deflate images sit next to (or maybe even overlay) each other to provide shading around the DCT/jpeg pictures, and when these deflate images are saved as PNG files the resulting xml/html is then severely messed up. I've tried every PDF->html/xml tool I could find, and they all seem to have more or less the same problem - the result might be mangled slightly differently, but basically it’s unusable in all the cases. My “fix” in pdftohtml worked only because it completely ignored all of these little deflate / PNG images (as the call to OutputDev::drawImage for them turns out to be a no-op). The obvious question now is - what is the easiest way to fix this? I know nothing about the layout code so I’ve no clue how easy/hard it might be to glue these images together correctly (i.e. the way they’re supposed to be shown – and do show in Acrobat) or even somehow ignore them automatically.

Any ideas?

BTW, if it would help to have a sample page (that exhibits the problem) to look at, just let me know how / where to post it – it’s trivial to extract just one sample page into a tiny PDF.

Thanks
PeterK
prcek is offline   Reply With Quote
Advert
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Always Produces TOC FrozenInferno Conversion 10 04-11-2011 11:39 AM
Recipe produces no Images - Please help. Onecanuck Recipes 6 12-16-2010 08:29 PM
Help: bogus database name! is this really a PalmOS file? newConverter Amazon Kindle 3 06-14-2010 10:49 AM
Bogus ads? Elsi Feedback 6 01-15-2009 05:18 AM
The Fuss About Gmail and Privacy: Read why it's Bogus Colin Dunstan Lounge 0 05-19-2004 04:33 PM


All times are GMT -4. The time now is 06:13 PM.


MobileRead.com is a privately owned, operated and funded community.