View Full Version : DJVU to ePub best results?


Begemot
03-09-2010, 05:29 AM
What methods have you used to convert DJVU to ePub?

The current method I am using is as follows:
Open DJVU with DjVuLibre DjView 4.4
Export as PDF
Then add PDF to Calibre Library and make ePub(this is basically running pdf2html utility)

Problem is the conversion to PDF step.
4MB DJVU explodes to 35MB PDF!
Then PDF to ePub goes down to 20-25MB, but the results are less than stellar.
Slightly smaller problem is that running pdf2html on a 35MB PDF takes about a half an hour, but I could live with that if the quality was good.

All of this was done on Ubuntu 9.10, but I would be interested in hearing about DJVU to ePub solutions on Windows or Mac as well.

charleski
03-09-2010, 10:04 AM
DjVu doesn't contain text, it works on images, and that's your problem. It uses an image-compression technology that's highly optimised for text and allows far smaller file sizes than other formats that target more general image types.

You could export the images and OCR them (lots of work to catch the errors), or you could try slapping the images all together as-is (which is what you describe above, possibly with some down-ressing to make it look even worse). Go back to the author and get a file that contains text, because DjVu is useless for your purpose.

Websterny
02-07-2011, 04:44 PM
I know nothing about this topic, but I have the same problem, and I note that DJVU files do seem to contain text - that is, the documents are searchable. They can be converted to non-searchable PDF with the print command (assuming you have a PDF print driver installed). But this significantly detracts from the utility of the files. And their size is a multiple of the original DJVU. There has got to be a better way.

BobC
02-07-2011, 05:16 PM
DjVu doesn't contain text, it works on images, and that's your problem. It uses an image-compression technology that's highly optimised for text and allows far smaller file sizes than other formats that target more general image types.



DJVU's Can contain a hidden text layer (which is used in the search feature). This layer can be extracted and used as the basis for any other conversion.

For example most of the DJVU files on The Internet Archive (TIA) contain such a layer and I have used them as a basis for FB2 books.

Of course the files the OP is working on may not have such a layer as the original text may not have been OCRd and associated with the image layer.

BobC

pholy
02-07-2011, 09:59 PM
BobC - Can you tell us how to extract that hidden text layer? I haven't run across any DJVU books that I recall, but it would be good to know how to convert them when possible.

BobC
02-09-2011, 05:55 PM
BobC - Can you tell us how to extract that hidden text layer?

Either Highlight the text in the Image view and use <CTL>C to copy it to the clipboard or use the "Export Text" feature in WinDJView or some similar Viewer.

BobC

Begemot
02-11-2011, 03:53 AM
OP here, I resorted to using export Text in WinDJView.

This gets you a text dump with no formatting whatsoever. For my Libre it works well enough, but in general, this procedure is suboptimal.

Most DJVU files do seem to have a text layer (unless there is some on the fly OCR happening when you select an area on the page, which seems unlikely).

Thus, there must be a way(at least theoretically until someone writes a converter) to preserve the formatting in the text layer.

bugmen0t
02-11-2011, 06:29 AM
I just print from any djvu reader into a pdf printer, like primopdf or other... only thing is: 240 pages book, 4.5Mb originally was transformed into 40Mb... maybe trimming the quality of the pdf down...

BobC
02-12-2011, 11:04 AM
OP here, I resorted to using export Text in WinDJView.

This gets you a text dump with no formatting whatsoever. For my Libre it works well enough, but in general, this procedure is suboptimal.

Most DJVU files do seem to have a text layer (unless there is some on the fly OCR happening when you select an area on the page, which seems unlikely).

Thus, there must be a way(at least theoretically until someone writes a converter) to preserve the formatting in the text layer.

I can assure you that the text layer is just that - text; it's purpose is simply to provide the search capability. There is no formatting and in many books there are OCR "mis-reads".

If you want to understand DJVUs then you need to get the spec and study it. I've done quite a bit of work with adding TOCs to existing DJVUs and have converted a couple of books to FB2 - this involves manually proof-reading and correcting the dumped text then formatting it to match the original (italics, bold etc).

Don't expect too much out of what is a by-product of the search function.

BobC

DaleDe
02-12-2011, 03:10 PM
There is a description of DJVU in the wiki.