IIRC, it's actually 1024 bytes of the uncompressed HTML file that make up a page. The idea is to be able to compute a number of pages solely from the zip headers, without having to look inside (i.e. uncompress) the HMTL files, and consequently to be able to navigate to a certain page by opening solely the corresponding HTML file. Zip headers indicate the uncompressed size, so there is no reason to use the compressed size; in fact, once you go "inside" an HTML, you need uncompressed bytes. This approximation works very well; after all, two paper editions of the same text can have very different page count, so there is no single truth (if you have to make it; of course EPUB also supports explicit page numbers, to match a given paper edition). Even the byte vs. character is not much of a problem in practice (although you can fool that by using NCRs in UTF-32 to get 32 bytes per character: "€" 8 characters per NCR and 4 bytes per character!)
Kindle took a different and more complicated approach. In particular, it better accounts for images. Also, remember that the notion of page comes into play for paying authors, so it needs to be more accurate; and Amazon is practically between every author and every reader, so it can spend some time determining page numbers.
|