View Single Post
Old 11-24-2010, 07:35 AM   #6
Man Eating Duck
Addict
Man Eating Duck juggles neatly with hedgehogs.Man Eating Duck juggles neatly with hedgehogs.Man Eating Duck juggles neatly with hedgehogs.Man Eating Duck juggles neatly with hedgehogs.Man Eating Duck juggles neatly with hedgehogs.Man Eating Duck juggles neatly with hedgehogs.Man Eating Duck juggles neatly with hedgehogs.Man Eating Duck juggles neatly with hedgehogs.Man Eating Duck juggles neatly with hedgehogs.Man Eating Duck juggles neatly with hedgehogs.Man Eating Duck juggles neatly with hedgehogs.
 
Posts: 254
Karma: 69786
Join Date: May 2006
Location: Oslo, Norway
Device: Kobo Aura, Sony PRS-650
Quote:
Originally Posted by ldolse View Post
How do you want to do the word count? Funnily enough I'm adding that this week for some other reasons, but I wasn't planning to do anything that was exposed to an end user.
Out of curiosity: What will you use it for?
I do this now by converting all books to text, run them through wc and inserting the values in a custom column with sql. It works well enough to give me an estimate of the book length, any fully automated approach will be slightly inflated anyway due to extra content. Word count is an informative number to me, and it's just about the only metric that make sense and is somewhat consistent for ebooks. Character count is another, but that's not something that's immediately meaningful to readers.
Quote:
My implementation is relatively simplistic for html - I'm just deleting the everything in the <head> section and then removing all the other tags with a regex. It's probably not always perfect but it's fast. Once that's done I'm using this code to do the actual count:
http://ginstrom.com/scribbles/2007/1...s-with-python/

The thing to do which could potentially be more accurate is to use this extra code which uses a proper parser to extract all translatable words (which was the original goal of this author):
http://ginstrom.com/scribbles/2008/0...e-with-python/
I think that the first approach gives a more intuitive result for books, as the second seems to include alt tag text among other things. It is more complex as well.

Quote:
Anyway I could put the word count into the debug log so you could see it in the job details.
Sure, couldn't hurt

Word count is not my motivation for starting this thread, it's just something I might do to become familiar with Python and Calibre. I like tinkering with code, but I'm certainly not capable of writing production quality code right now, and don't know if I ever will be.
Man Eating Duck is offline   Reply With Quote