View Single Post
Old 01-07-2016, 12:09 AM   #3
davidfor
Grand Sorcerer
davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.
 
Posts: 24,905
Karma: 47303824
Join Date: Jul 2011
Location: Sydney, Australia
Device: Kobo:Touch,Glo, AuraH2O, GloHD,AuraONE, ClaraHD, Libra H2O; tolinoepos
The ICU word iterator seems to be used by the editor. Was it included in calibre when kiwidude created the plugin?

And for those who are interested, here are the word counts using the two methods for the first eleven books listed in one of my libraries.
Code:
_get_epub_standard_word_count - get_wordcount_obj: 82056
_get_epub_standard_word_count - split_into_words: 81026
_get_epub_standard_word_count - get_wordcount_obj/split_into_words: 1.012711969

_get_epub_standard_word_count - get_wordcount_obj: 137576
_get_epub_standard_word_count - split_into_words: 136651
_get_epub_standard_word_count - get_wordcount_obj/split_into_words: 1.00676906865

_get_epub_standard_word_count - get_wordcount_obj: 75437
_get_epub_standard_word_count - split_into_words: 74991
_get_epub_standard_word_count - get_wordcount_obj/split_into_words: 1.00594738035

_get_epub_standard_word_count - get_wordcount_obj: 55969
_get_epub_standard_word_count - split_into_words: 55810
_get_epub_standard_word_count - get_wordcount_obj/split_into_words: 1.0028489518

_get_epub_standard_word_count - get_wordcount_obj: 123067
_get_epub_standard_word_count - split_into_words: 120726
_get_epub_standard_word_count - get_wordcount_obj/split_into_words: 1.01939101768

_get_epub_standard_word_count - get_wordcount_obj: 36686
_get_epub_standard_word_count - split_into_words: 36032
_get_epub_standard_word_count - get_wordcount_obj/split_into_words: 1.01815053286

_get_epub_standard_word_count - get_wordcount_obj: 5995
_get_epub_standard_word_count - split_into_words: 5853
_get_epub_standard_word_count - get_wordcount_obj/split_into_words: 1.0242610627

_get_epub_standard_word_count - get_wordcount_obj: 100406
_get_epub_standard_word_count - split_into_words: 99683
_get_epub_standard_word_count - get_wordcount_obj/split_into_words: 1.00725299198

_get_epub_standard_word_count - get_wordcount_obj: 21751
_get_epub_standard_word_count - split_into_words: 21620
_get_epub_standard_word_count - get_wordcount_obj/split_into_words: 1.00605920444

_get_epub_standard_word_count - get_wordcount_obj: 18539
_get_epub_standard_word_count - split_into_words: 18458
_get_epub_standard_word_count - get_wordcount_obj/split_into_words: 1.0043883411

_get_epub_standard_word_count - get_wordcount_obj: 57546
_get_epub_standard_word_count - split_into_words: 56533
_get_epub_standard_word_count - get_wordcount_obj/split_into_words: 1.01791873773
The "get_wordcount_obj" number is the method Count Pages uses and split_into_words uses the ICU Word Iterator . And I divide the former by the latter as a quick way to see how much they differ.

Personally, I don't care. The current count is close enough for most purposes. I'd be tempted to round the numbers to the nearest 1000 for most book. One thing I do see is that the ICU Word Iterator can take a language as a parameter. That probably makes it better for non-English users. Unfortunately, adding that would need a lot more changes.
davidfor is offline   Reply With Quote