Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre

Notices

Reply
 
Thread Tools Search this Thread
Old 01-06-2016, 05:25 PM   #1
JSWolf
Resident Curmudgeon
JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.
 
JSWolf's Avatar
 
Posts: 80,513
Karma: 150249619
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
Bug in get_wordcount_obj(book_text)

I have reported a bug in the Count Pages plugin where it can count multiple words as one word.

except…if
except—if
except–if

Those are counted as one word each instead of two words.

It turns out the bug is in Calibre.

Quote:
Originally Posted by PeterT View Post
You probably have to open this as a calibre bug; the plugin uses
Code:
from calibre.utils.wordcount import get_wordcount_obj

    book_text = _read_epub_contents(iterator, strip_html=True)
    wordcount = get_wordcount_obj(book_text)
to get the word count
JSWolf is offline   Reply With Quote
Old 01-06-2016, 10:54 PM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 45,579
Karma: 28548962
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
It's not a bug in calibre. the get_wordcount function is not designed for accurate word counts. It is used only in heuristics to try to auto detect chapter boundaries based on approximate word counts. I have no idea why the count pages plugin uses that function. Instead it should be using the ICU word iterator functions, for examples of their use, see break_iterator.py
kovidgoyal is offline   Reply With Quote
Old 01-07-2016, 12:09 AM   #3
davidfor
Grand Sorcerer
davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.
 
Posts: 24,905
Karma: 47303824
Join Date: Jul 2011
Location: Sydney, Australia
Device: Kobo:Touch,Glo, AuraH2O, GloHD,AuraONE, ClaraHD, Libra H2O; tolinoepos
The ICU word iterator seems to be used by the editor. Was it included in calibre when kiwidude created the plugin?

And for those who are interested, here are the word counts using the two methods for the first eleven books listed in one of my libraries.
Code:
_get_epub_standard_word_count - get_wordcount_obj: 82056
_get_epub_standard_word_count - split_into_words: 81026
_get_epub_standard_word_count - get_wordcount_obj/split_into_words: 1.012711969

_get_epub_standard_word_count - get_wordcount_obj: 137576
_get_epub_standard_word_count - split_into_words: 136651
_get_epub_standard_word_count - get_wordcount_obj/split_into_words: 1.00676906865

_get_epub_standard_word_count - get_wordcount_obj: 75437
_get_epub_standard_word_count - split_into_words: 74991
_get_epub_standard_word_count - get_wordcount_obj/split_into_words: 1.00594738035

_get_epub_standard_word_count - get_wordcount_obj: 55969
_get_epub_standard_word_count - split_into_words: 55810
_get_epub_standard_word_count - get_wordcount_obj/split_into_words: 1.0028489518

_get_epub_standard_word_count - get_wordcount_obj: 123067
_get_epub_standard_word_count - split_into_words: 120726
_get_epub_standard_word_count - get_wordcount_obj/split_into_words: 1.01939101768

_get_epub_standard_word_count - get_wordcount_obj: 36686
_get_epub_standard_word_count - split_into_words: 36032
_get_epub_standard_word_count - get_wordcount_obj/split_into_words: 1.01815053286

_get_epub_standard_word_count - get_wordcount_obj: 5995
_get_epub_standard_word_count - split_into_words: 5853
_get_epub_standard_word_count - get_wordcount_obj/split_into_words: 1.0242610627

_get_epub_standard_word_count - get_wordcount_obj: 100406
_get_epub_standard_word_count - split_into_words: 99683
_get_epub_standard_word_count - get_wordcount_obj/split_into_words: 1.00725299198

_get_epub_standard_word_count - get_wordcount_obj: 21751
_get_epub_standard_word_count - split_into_words: 21620
_get_epub_standard_word_count - get_wordcount_obj/split_into_words: 1.00605920444

_get_epub_standard_word_count - get_wordcount_obj: 18539
_get_epub_standard_word_count - split_into_words: 18458
_get_epub_standard_word_count - get_wordcount_obj/split_into_words: 1.0043883411

_get_epub_standard_word_count - get_wordcount_obj: 57546
_get_epub_standard_word_count - split_into_words: 56533
_get_epub_standard_word_count - get_wordcount_obj/split_into_words: 1.01791873773
The "get_wordcount_obj" number is the method Count Pages uses and split_into_words uses the ICU Word Iterator . And I divide the former by the latter as a quick way to see how much they differ.

Personally, I don't care. The current count is close enough for most purposes. I'd be tempted to round the numbers to the nearest 1000 for most book. One thing I do see is that the ICU Word Iterator can take a language as a parameter. That probably makes it better for non-English users. Unfortunately, adding that would need a lot more changes.
davidfor is offline   Reply With Quote
Old 01-07-2016, 01:46 AM   #4
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 45,579
Karma: 28548962
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
ICU has existed in calibre for a very long time, whether that specific module existed (which you are correct is used by the editor) I am not sure, but probably not.
kovidgoyal is offline   Reply With Quote
Old 01-07-2016, 09:30 PM   #5
davidfor
Grand Sorcerer
davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.
 
Posts: 24,905
Karma: 47303824
Join Date: Jul 2011
Location: Sydney, Australia
Device: Kobo:Touch,Glo, AuraH2O, GloHD,AuraONE, ClaraHD, Libra H2O; tolinoepos
@Kovid: Thanks for the word_count method. That makes everything that little bit simpler.
davidfor is offline   Reply With Quote
Old 01-07-2016, 09:56 PM   #6
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 45,579
Karma: 28548962
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
You're welcome
kovidgoyal is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
[BUG] - M96 out of memory - [BUG] Alf77 Onyx Boox 5 02-05-2015 11:47 AM
DR800 Help, I've got a bug!! A bug on my screen!! Franky iRex 4 06-21-2011 11:45 AM
Embedded font bug or CSS bug in ADE JSWolf ePub 10 06-11-2011 02:34 PM
Possible Bug in 4.21 Amalthia Calibre 13 01-12-2009 07:00 PM
PRS-505 bug or eBookLib bug? porkupan Sony Reader 3 10-07-2007 10:44 PM


All times are GMT -4. The time now is 05:28 PM.


MobileRead.com is a privately owned, operated and funded community.