View Full Version : DJVU: Extract number of pages


JensW
07-06-2013, 04:29 AM
Hello everyone,
for the GUI for k2pdfopt I develop I need to find out the number of pages in a DjVu-file. Currently I'm doing it by just counting the number of instances of the "DJVU" string in the file which works well. The problem is that I need to load the whole file and iterate through it which takes both a lot of memory and time for large files.

I know that there is the DIRM directory at the beginning of the file which "should" contain the number of files or pages in the document according to the format specifications.
The DIRM string should be followed by one byte of flags and then an INT16 containing the number of pages. However I can not see that in the file. The only occurence which makes a little sense arebytes 6/7 (or 7/8 depending on big/little endian) which contain an int16 at least in the region of the page count (but not exactly).

Can someone tell me where my error lies?

As an example, the first 16 byte including the DIRM string for a file with 57 pages:

D I R M
44 49 52 4D 00 00 02 0F 81 00 3C 00 00 02 D2 00 00 1B 7A 00 00 47 30 00


As you can see bytes 2/3 after the the DIRM contain the number 512 and 7/8 contain the number 60 which is close but not quite there.

I really don't understand this.

You can find the format specs here: https://github.com/barak/djvulibre/tree/master/doc

Thank you in advance, any help is greatly appreciated!

- Jens


PS: I found an example but I'm really terrible in C++. Could someone tell me what exactly this does? How does the << and + operators work on the char datatype in C++?


unsigned int
ByteStream::read16()
{
unsigned char c[2];
if (readall((void*)c, sizeof(c)) != sizeof(c))
G_THROW( ByteStream::EndOfFile );
return (c[0]<<8)+c[1];
}

rkomar
07-06-2013, 01:21 PM
"<<8" shifts left by 8 bits, which is the same as multiplying by 256. So, the function returns 256*c[0] + c[1], i.e. the program is interpreting those two bytes as being uint16_t in big endian order.

The registers in which the math takes place are probably at least 32 bits wide, so the addition of the temporary c[0]<<8 and c[1] values is done without truncating back to 8-bits (size of unsigned char) before it is returned as an unsigned int.