DJVU: Extract number of pages

JensW · 07-06-2013, 04:29 AM

Hello everyone,
for the GUI for k2pdfopt I develop I need to find out the number of pages in a DjVu-file. Currently I'm doing it by just counting the number of instances of the "DJVU" string in the file which works well. The problem is that I need to load the whole file and iterate through it which takes both a lot of memory and time for large files.

I know that there is the DIRM directory at the beginning of the file which "should" contain the number of files or pages in the document according to the format specifications.
The DIRM string should be followed by one byte of flags and then an INT16 containing the number of pages. However I can not see that in the file. The only occurence which makes a little sense arebytes 6/7 (or 7/8 depending on big/little endian) which contain an int16 at least in the region of the page count (but not exactly).

Can someone tell me where my error lies?

As an example, the first 16 byte including the DIRM string for a file with 57 pages:

Code:

D  I  R  M
44 49 52 4D 00 00 02 0F 81 00 3C 00 00 02 D2 00 00 1B 7A 00 00 47 30 00

As you can see bytes 2/3 after the the DIRM contain the number 512 and 7/8 contain the number 60 which is close but not quite there.

I really don't understand this.

You can find the format specs here: https://github.com/barak/djvulibre/tree/master/doc

Thank you in advance, any help is greatly appreciated!

- Jens

PS: I found an example but I'm really terrible in C++. Could someone tell me what exactly this does? How does the << and + operators work on the char datatype in C++?

Code:

unsigned int 
ByteStream::read16()
{
  unsigned char c[2];
  if (readall((void*)c, sizeof(c)) != sizeof(c))
    G_THROW( ByteStream::EndOfFile );
  return (c[0]<<8)+c[1];
}

rkomar · 07-06-2013, 01:21 PM

"<<8" shifts left by 8 bits, which is the same as multiplying by 256. So, the function returns 256*c[0] + c[1], i.e. the program is interpreting those two bytes as being uint16_t in big endian order.

The registers in which the math takes place are probably at least 32 bits wide, so the addition of the temporary c[0]<<8 and c[1] values is done without truncating back to 8-bits (size of unsigned char) before it is returned as an unsigned int.

07-06-2013, 04:29 AM	#1
JensW Enthusiast Posts: 29 Karma: 81500 Join Date: Apr 2013 Device: Kindle 4	DJVU: Extract number of pages Hello everyone, for the GUI for k2pdfopt I develop I need to find out the number of pages in a DjVu-file. Currently I'm doing it by just counting the number of instances of the "DJVU" string in the file which works well. The problem is that I need to load the whole file and iterate through it which takes both a lot of memory and time for large files. I know that there is the DIRM directory at the beginning of the file which "should" contain the number of files or pages in the document according to the format specifications. The DIRM string should be followed by one byte of flags and then an INT16 containing the number of pages. However I can not see that in the file. The only occurence which makes a little sense arebytes 6/7 (or 7/8 depending on big/little endian) which contain an int16 at least in the region of the page count (but not exactly). Can someone tell me where my error lies? As an example, the first 16 byte including the DIRM string for a file with 57 pages: Code: D I R M 44 49 52 4D 00 00 02 0F 81 00 3C 00 00 02 D2 00 00 1B 7A 00 00 47 30 00 As you can see bytes 2/3 after the the DIRM contain the number 512 and 7/8 contain the number 60 which is close but not quite there. I really don't understand this. You can find the format specs here: https://github.com/barak/djvulibre/tree/master/doc Thank you in advance, any help is greatly appreciated! - Jens PS: I found an example but I'm really terrible in C++. Could someone tell me what exactly this does? How does the << and + operators work on the char datatype in C++? Code: unsigned int ByteStream::read16() { unsigned char c[2]; if (readall((void)c, sizeof(c)) != sizeof(c)) G_THROW( ByteStream::EndOfFile ); return (c[0]<<8)+c[1]; } Last edited by JensW; 07-06-2013 at 05:08 AM.*

07-06-2013, 01:21 PM	#2
rkomar Wizard Posts: 3,052 Karma: 18821071 Join Date: Oct 2010 Location: Sudbury, ON, Canada Device: PRS-505, PB 902, PRS-T1, PB 623, PB 840, PB 633	"<<8" shifts left by 8 bits, which is the same as multiplying by 256. So, the function returns 256c[0] + c[1], i.e. the program is interpreting those two bytes as being uint16_t in big endian order. The registers in which the math takes place are probably at least 32 bits wide, so the addition of the temporary c[0]<<8 and c[1] values is done without truncating back to 8-bits (size of unsigned char) before it is returned as an unsigned int. Last edited by rkomar; 07-06-2013 at 06:37 PM. Reason: Got the endianness wrong :P*

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Number of pages left in chapter not working	3rdDegree	Conversion	3	02-22-2013 09:38 PM
Do the number of pages in an ebook differ from the number of pages in a physical book	Phoebemy	General Discussions	12	07-19-2012 09:25 AM
number of pages/locations/words?	egg	Calibre	3	11-25-2010 04:47 AM
Conversion - Can you keep same number of pages	goldberry	Calibre	4	09-12-2010 12:11 AM
How are the page numbers/number of pages defined?	kennyc	ePub	8	09-27-2009 11:23 AM

Advert