View Single Post
Old 08-31-2023, 02:54 AM   #1660
kiwidude
Calibre Plugins Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,733
Karma: 2197770
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
Certainly calibre changed in the last 12 years to using a different algorithm. Kovid was kind enough to point to the code that is generating that for the ebook viewer, the problem I have at the moment is it requires parsing the book in a different way than the plugin does currently.

I will see if Kovid has any suggestions that may help with implementing his way of processing the ebook to see if I could reuse his actual viewer code. Frankly without his help this will go in the too hard basket at this point in time as my care factor is pretty low when it comes to "accuracy" of page counts. It is impossible to be definitive about because of all the variables - any algorithm is an estimate only and should be used purely to compare the *relative* size of book A vs book B of both using the same algorithm. You are never going to get "the one true value" as no such thing exists. Go lookup paperback vs hardback edition page counts in the printed world which vary based on font sizes, page sizes, layout choices etc - then all the similar variables in the electronic world of device/screen sizes and you will hopefully realise just how ambiguous the term "page" means outside of a printed edition. Right now this plugin may tell me 410 whereas calibre viewer says 388, yet goodreads says the printed paperback edition is 240! It is simply not worth getting too worked up about.

I do agree it is currently misleading to imply this plugin produces the same count as the ebook viewer does in calibre today. If I can't make it produce exactly the same then I will rename things a bit to prevent misconceptions on that. If someone else wants to wrap their head around the relevant calibre code in render_book.py and make a contribution I am happy to take a look at merging it.

FYI how the ebook viewer used to estimate (what this plugin does currently for this option) is just a very crude total characters in the html pages of the book / 1000. Whereas the current ebook viewer code parses the body sections of pages, has various tags it will exclude, strips out all whitespace and divides that by 1000. It also treats every img tag it finds as a page. So unsurprisingly you get quite different counts - with the "old" page count approach generally producing a higher count.

This plugin does already have a number of algorithms based around processing the body section of html and stripping out all tags. However it does so using raw text and regex, rather than XHTML which Kovid's code uses and some extra rules he has in there. For instance my plugin code would strip all img tags completely rather than treating each as a page. I can't make my text based parsing match the counts of XHTML based parsing, hence why this becomes a lot more work...
kiwidude is offline   Reply With Quote