Quote:
Originally Posted by ldolse
The way I do it in heuristics is just delete all the html tags with a regex (potentially error prone but faster than proper text conversion). Then I just count everything that's left.
|
That sounds like a sensible way to go about it. I know I'm not looking for an *exact* word count, just an idea of proportionally how 'long' the book is compared to other books (or short stories) in the collection. File size is misleading in this respect, since it includes the graphics which, while nice to have, don't impact the 'reading length' of the book.
If we had access to a stored word count for each book, that could be converted to approximate page count easily by dividing by 200 (or 500 or whatever the typical number of words on a page is, idk).
One quick question though: would the calculated word counts be persistent, or would they have to be re-calculated (at the user's request of course) each time the program starts up?
Thanks!
cc