How does the calibre viewer calculate page number and total pages?

auspex · 09-15-2014, 09:48 PM

I'm working on a port of davidfor's Kobo Utilities to Sony, and trying to find a reasonable way to find my position in a book that I'm currently reading. Sony doesn't make it easy. If you downloaded a book from Sony's store, or since they sold out, from Kobo, they maintain a table that gives amongst other things "percent read", but if your book is sideloaded, Sony appears to calculate the number of pages and your current page number on the fly, and it's never saved in its database (and of course they don't tell US how they do it).

So, I'm trying to figure out how the calibre viewer calculates these numbers, and can't find the code anywhere.

kovidgoyal · 09-16-2014, 12:00 AM

iterator/book.py

davidfor · 09-16-2014, 12:35 AM

For epubs on the Sony devices, the number of pages will be calculated by the Adobe RMSDK. And the current page will be based on that. The description of the method is in the Wiki, but have a look the Count Pages for an implementation.

To calculate a percent read, what Kovid pointed to should work. You will also need the current position from the database on the Sony. From memory, this is stored in an Adobe specific way. I assume it comes from the RMSDK as the Kobo's use it for epubs as well. The calibre viewer uses a different position method (the same as for epub3?). I don't know if there is already a way to translate between them, but it shouldn't be to hard*.

From memory, iterator/book.py has to unpack the book to work. That means calculating the percent read could take some time. For one book, it shouldn't be to bad, but if you are doing it for all the books on the device, it might take a while. I suppose that should only happen once when the store positions is first run.

* Imagine me laughing maniacally while I typed that.

auspex · 09-16-2014, 09:10 AM

As far as I can tell, the only thing that Sony stores for position of sideloaded books is the bookmark. Which is like an EPUB3 CFI, but not identical (nor is calibre's), but is easily (so much for maniacal laughter!) translated to the calibre format (they're closer to each other than to EPUB3). FWIW, EPUB3 counts nodes (text nodes + tags) while calibre/Sony seem to count only tags, with the significant difference that Sony's CFIs don't count the <HEAD> tag.

So, in any case, it's going to have to open the book to calculate the position.

Thanks for the answers. Now, off to try some more stuff!

davidfor · 09-16-2014, 09:55 AM

Hmm, you're right, it is easy. It's been a while since I compared the two methods. And not counting the head tag has always bugged me when I've looked at this.

With that, it would be easy to put the reading position or bookmarks into the epub for the viewer.

auspex · 09-16-2014, 11:31 AM

That's what I was thinking.

auspex · 09-16-2014, 10:06 PM

iterator/book.py calculates the total number of pages. I'm still not seeing anything that translates a bookmark into a current page number.

kovidgoyal · 09-16-2014, 11:21 PM

There is nothing that translates a bookmark into a page number. A page number is simply defined as

(number of pages of current html file * frac of file scrolled)/(total number of pages of all current html files)

If you are are asking how the viewer scrolls to a bookmark, look at cfi.coffee

And note that EPUB 3 CFI does not count text nodes. It makes no sense to count text nodes, since:

1) Text nodes can be normalized by the renderer
2) Offsets as numbers of characters in the terminal tag are recorded in the CFI in any case, making counting text nodes totally useless.

What the EPUB CFI spec does is assign odd numbered indices to represent the text between tags regardless of how many actual text nodes there are. So tags are always even numbered.

auspex · 09-23-2014, 03:36 PM

Quote:

Originally Posted by kovidgoyal

There is nothing that translates a bookmark into a page number. A page number is simply defined as

(number of pages of current html file * frac of file scrolled)/(total number of pages of all current html files)

If you are are asking how the viewer scrolls to a bookmark, look at cfi.coffee

Well, I wasn't asking how it scrolls; I was asking how you calculate the current page number, but I was afraid that was the answer. Which means I have to do it myself, as there's nothing I can call in calibre to do it. Still, it's not as awkward as I was thinking, as I can see the spine and the page counts in the iterator (though it's confusing that it's called an iterator, when it doesn't meet the python definition of an iterator...)

I guess it's fortunate that I lucked into a poorly formatted page on my first test. The calibre viewer and the Sony bookmark had similar pointers into this structure (.../2[heading_id_2]/4@4.9:0 and .../2/4:1, respectively)

Code:

<h1 class="part" id="heading_id_2">
  <a id="page10"/>
  <img alt="" src="../Images/Wint_9781594745775_epub_001_r1.jpg"/>
</h1>

Of course it makes no sense to have a self-closing anchor tag, and both the BeautifulSoup and BeautifulStoneSoup parsers parse it as:

Code:

<h1 class="part" id="heading_id_2">
  <a id="page10">
    <img alt="" src="../Images/Wint_9781594745775_epub_001_r1.jpg"/>
  </a>
</h1>

Its not going to make any real difference what result I get for this page but it demonstrates a very real problem, and I'm not sure how to work around it.

Quote:

And note that EPUB 3 CFI does not count text nodes. It makes no sense to count text nodes, since:

1) Text nodes can be normalized by the renderer
2) Offsets as numbers of characters in the terminal tag are recorded in the CFI in any case, making counting text nodes totally useless.

What the EPUB CFI spec does is assign odd numbered indices to represent the text between tags regardless of how many actual text nodes there are. So tags are always even numbered.

Right, I was (probably mis-)quoting from someone else's simplified explanation of the EPUB CFI but I really did know there could be more than one text node.

kovidgoyal · 09-24-2014, 08:10 AM

You should not use BeautifulSoup to parse. The parsing strategy to follow would be:

1) Try to parse as XML, implementing various simple corrections so that only slightly invalid documents still parse.
2) If (1) fails, parse as HTML 5
3) If (2) fails parse as HTML 4 and/or use BeautifulSoup

See parse_utils.py in the calibre source code.

Of course, the correct solution is to use the exact parsing algorithm used by the software that generated the CFI, since that is no practical, IMO the above cascade will likely give yo the best results, with perhaps a few modifications to handle common cases.

09-15-2014, 09:48 PM	#1
auspex Groupie Posts: 199 Karma: 1071756 Join Date: Sep 2012 Location: Nova Scotia Device: Kobo Aura, Nexus 5x	How does the calibre viewer calculate page number and total pages? I'm working on a port of davidfor's Kobo Utilities to Sony, and trying to find a reasonable way to find my position in a book that I'm currently reading. Sony doesn't make it easy. If you downloaded a book from Sony's store, or since they sold out, from Kobo, they maintain a table that gives amongst other things "percent read", but if your book is sideloaded, Sony appears to calculate the number of pages and your current page number on the fly, and it's never saved in its database (and of course they don't tell US how they do it). So, I'm trying to figure out how the calibre viewer calculates these numbers, and can't find the code anywhere.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Aura HD Total number of book page	n3xtITA	Kobo Reader	26	12-23-2013 06:58 AM
Total number of pages	xaim	Marvin	5	11-17-2013 09:59 AM
Show Total Number of Books in Calibre Library	Canadian reader	Library Management	8	08-29-2013 11:29 PM
Does Kobo display total number of pages?	foghat	Kobo Reader	24	06-12-2010 01:10 AM
How are the page numbers/number of pages defined?	kennyc	ePub	8	09-27-2009 11:23 AM

09-16-2014, 12:00 AM	#2
kovidgoyal creator of calibre Posts: 43,858 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	iterator/book.py

09-16-2014, 12:35 AM	#3
davidfor Grand Sorcerer Posts: 24,907 Karma: 47303748 Join Date: Jul 2011 Location: Sydney, Australia Device: Kobo:Touch,Glo, AuraH2O, GloHD,AuraONE, ClaraHD, Libra H2O; tolinoepos	For epubs on the Sony devices, the number of pages will be calculated by the Adobe RMSDK. And the current page will be based on that. The description of the method is in the Wiki, but have a look the Count Pages for an implementation. To calculate a percent read, what Kovid pointed to should work. You will also need the current position from the database on the Sony. From memory, this is stored in an Adobe specific way. I assume it comes from the RMSDK as the Kobo's use it for epubs as well. The calibre viewer uses a different position method (the same as for epub3?). I don't know if there is already a way to translate between them, but it shouldn't be to hard. From memory, iterator/book.py has to unpack the book to work. That means calculating the percent read could take some time. For one book, it shouldn't be to bad, but if you are doing it for all the books on the device, it might take a while. I suppose that should only happen once when the store positions is first run. Imagine me laughing maniacally while I typed that.

09-16-2014, 09:10 AM	#4
auspex Groupie Posts: 199 Karma: 1071756 Join Date: Sep 2012 Location: Nova Scotia Device: Kobo Aura, Nexus 5x	As far as I can tell, the only thing that Sony stores for position of sideloaded books is the bookmark. Which is like an EPUB3 CFI, but not identical (nor is calibre's), but is easily (so much for maniacal laughter!) translated to the calibre format (they're closer to each other than to EPUB3). FWIW, EPUB3 counts nodes (text nodes + tags) while calibre/Sony seem to count only tags, with the significant difference that Sony's CFIs don't count the <HEAD> tag. So, in any case, it's going to have to open the book to calculate the position. Thanks for the answers. Now, off to try some more stuff!

09-16-2014, 09:55 AM	#5
davidfor Grand Sorcerer Posts: 24,907 Karma: 47303748 Join Date: Jul 2011 Location: Sydney, Australia Device: Kobo:Touch,Glo, AuraH2O, GloHD,AuraONE, ClaraHD, Libra H2O; tolinoepos	Hmm, you're right, it is easy. It's been a while since I compared the two methods. And not counting the head tag has always bugged me when I've looked at this. With that, it would be easy to put the reading position or bookmarks into the epub for the viewer.

09-16-2014, 11:31 AM	#6
auspex Groupie Posts: 199 Karma: 1071756 Join Date: Sep 2012 Location: Nova Scotia Device: Kobo Aura, Nexus 5x	That's what I was thinking.

09-16-2014, 10:06 PM	#7
auspex Groupie Posts: 199 Karma: 1071756 Join Date: Sep 2012 Location: Nova Scotia Device: Kobo Aura, Nexus 5x	iterator/book.py calculates the total number of pages. I'm still not seeing anything that translates a bookmark into a current page number.

09-16-2014, 11:21 PM	#8
kovidgoyal creator of calibre Posts: 43,858 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	There is nothing that translates a bookmark into a page number. A page number is simply defined as (number of pages of current html file * frac of file scrolled)/(total number of pages of all current html files) If you are are asking how the viewer scrolls to a bookmark, look at cfi.coffee And note that EPUB 3 CFI does not count text nodes. It makes no sense to count text nodes, since: 1) Text nodes can be normalized by the renderer 2) Offsets as numbers of characters in the terminal tag are recorded in the CFI in any case, making counting text nodes totally useless. What the EPUB CFI spec does is assign odd numbered indices to represent the text between tags regardless of how many actual text nodes there are. So tags are always even numbered.

09-24-2014, 08:10 AM	#10
kovidgoyal creator of calibre Posts: 43,858 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	You should not use BeautifulSoup to parse. The parsing strategy to follow would be: 1) Try to parse as XML, implementing various simple corrections so that only slightly invalid documents still parse. 2) If (1) fails, parse as HTML 5 3) If (2) fails parse as HTML 4 and/or use BeautifulSoup See parse_utils.py in the calibre source code. Of course, the correct solution is to use the exact parsing algorithm used by the software that generated the CFI, since that is no practical, IMO the above cascade will likely give yo the best results, with perhaps a few modifications to handle common cases.

Advert

Advert