![]() |
#1 |
Resident Curmudgeon
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 80,503
Karma: 150249619
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
|
Bug in get_wordcount_obj(book_text)
I have reported a bug in the Count Pages plugin where it can count multiple words as one word.
except…if except—if except–if Those are counted as one word each instead of two words. It turns out the bug is in Calibre. |
![]() |
![]() |
![]() |
#2 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,579
Karma: 28548962
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
It's not a bug in calibre. the get_wordcount function is not designed for accurate word counts. It is used only in heuristics to try to auto detect chapter boundaries based on approximate word counts. I have no idea why the count pages plugin uses that function. Instead it should be using the ICU word iterator functions, for examples of their use, see break_iterator.py
|
![]() |
![]() |
Advert | |
|
![]() |
#3 |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 24,905
Karma: 47303824
Join Date: Jul 2011
Location: Sydney, Australia
Device: Kobo:Touch,Glo, AuraH2O, GloHD,AuraONE, ClaraHD, Libra H2O; tolinoepos
|
The ICU word iterator seems to be used by the editor. Was it included in calibre when kiwidude created the plugin?
And for those who are interested, here are the word counts using the two methods for the first eleven books listed in one of my libraries. Code:
_get_epub_standard_word_count - get_wordcount_obj: 82056 _get_epub_standard_word_count - split_into_words: 81026 _get_epub_standard_word_count - get_wordcount_obj/split_into_words: 1.012711969 _get_epub_standard_word_count - get_wordcount_obj: 137576 _get_epub_standard_word_count - split_into_words: 136651 _get_epub_standard_word_count - get_wordcount_obj/split_into_words: 1.00676906865 _get_epub_standard_word_count - get_wordcount_obj: 75437 _get_epub_standard_word_count - split_into_words: 74991 _get_epub_standard_word_count - get_wordcount_obj/split_into_words: 1.00594738035 _get_epub_standard_word_count - get_wordcount_obj: 55969 _get_epub_standard_word_count - split_into_words: 55810 _get_epub_standard_word_count - get_wordcount_obj/split_into_words: 1.0028489518 _get_epub_standard_word_count - get_wordcount_obj: 123067 _get_epub_standard_word_count - split_into_words: 120726 _get_epub_standard_word_count - get_wordcount_obj/split_into_words: 1.01939101768 _get_epub_standard_word_count - get_wordcount_obj: 36686 _get_epub_standard_word_count - split_into_words: 36032 _get_epub_standard_word_count - get_wordcount_obj/split_into_words: 1.01815053286 _get_epub_standard_word_count - get_wordcount_obj: 5995 _get_epub_standard_word_count - split_into_words: 5853 _get_epub_standard_word_count - get_wordcount_obj/split_into_words: 1.0242610627 _get_epub_standard_word_count - get_wordcount_obj: 100406 _get_epub_standard_word_count - split_into_words: 99683 _get_epub_standard_word_count - get_wordcount_obj/split_into_words: 1.00725299198 _get_epub_standard_word_count - get_wordcount_obj: 21751 _get_epub_standard_word_count - split_into_words: 21620 _get_epub_standard_word_count - get_wordcount_obj/split_into_words: 1.00605920444 _get_epub_standard_word_count - get_wordcount_obj: 18539 _get_epub_standard_word_count - split_into_words: 18458 _get_epub_standard_word_count - get_wordcount_obj/split_into_words: 1.0043883411 _get_epub_standard_word_count - get_wordcount_obj: 57546 _get_epub_standard_word_count - split_into_words: 56533 _get_epub_standard_word_count - get_wordcount_obj/split_into_words: 1.01791873773 Personally, I don't care. The current count is close enough for most purposes. I'd be tempted to round the numbers to the nearest 1000 for most book. One thing I do see is that the ICU Word Iterator can take a language as a parameter. That probably makes it better for non-English users. Unfortunately, adding that would need a lot more changes. |
![]() |
![]() |
![]() |
#4 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,579
Karma: 28548962
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
ICU has existed in calibre for a very long time, whether that specific module existed (which you are correct is used by the editor) I am not sure, but probably not.
|
![]() |
![]() |
![]() |
#5 |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 24,905
Karma: 47303824
Join Date: Jul 2011
Location: Sydney, Australia
Device: Kobo:Touch,Glo, AuraH2O, GloHD,AuraONE, ClaraHD, Libra H2O; tolinoepos
|
@Kovid: Thanks for the word_count method. That makes everything that little bit simpler.
|
![]() |
![]() |
Advert | |
|
![]() |
#6 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,579
Karma: 28548962
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
You're welcome
![]() |
![]() |
![]() |
![]() |
Thread Tools | Search this Thread |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
[BUG] - M96 out of memory - [BUG] | Alf77 | Onyx Boox | 5 | 02-05-2015 11:47 AM |
DR800 Help, I've got a bug!! A bug on my screen!! | Franky | iRex | 4 | 06-21-2011 11:45 AM |
Embedded font bug or CSS bug in ADE | JSWolf | ePub | 10 | 06-11-2011 02:34 PM |
Possible Bug in 4.21 | Amalthia | Calibre | 13 | 01-12-2009 07:00 PM |
PRS-505 bug or eBookLib bug? | porkupan | Sony Reader | 3 | 10-07-2007 10:44 PM |