Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Plugins

Notices

Reply
 
Thread Tools Search this Thread
Old 01-04-2017, 07:07 PM   #961
davidfor
Grand Sorcerer
davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.
 
Posts: 24,905
Karma: 47303824
Join Date: Jul 2011
Location: Sydney, Australia
Device: Kobo:Touch,Glo, AuraH2O, GloHD,AuraONE, ClaraHD, Libra H2O; tolinoepos
Quote:
Originally Posted by theducks View Post
Use Current Preferred output format: (<show the value as a hint only>)

Standard Calibre usage locks in the then current settings
eg Convert


BTW my Library used ADE page counts historically (Astak PEz).

I now use my K4NT
CP won't count using ADE when only a AZW3 exists
I really don't want to recalc all those (besides, changes the Last Modified flag)

A count method sb just an algorithm to use (assumes the filoe can be processed) If I want to count a Kindle Book using ADE rules, why not?
The problem is the algorithm. The ADE page count uses the compressed size of the internal text files. It doesn't matter what is in the files, it just divides the compresses size by 1024 to get the number of pages for each file and then adds them up. There are all sorts of things wrong with that as an algorithm, but that's the way it works. A simple one is that if you use a different compression level, you get a different page count.

I suppose the same could be done for AZW3, but, I have no idea of the internal format of the format, so I don't know if it is constructed in a suitable way. Plus, whatever compression is used will affect it.

I'll put it on the list to look at, but I can't promise anything.
davidfor is offline   Reply With Quote
Old 01-04-2017, 07:08 PM   #962
JSWolf
Resident Curmudgeon
JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.
 
JSWolf's Avatar
 
Posts: 80,083
Karma: 147983159
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
Quote:
Originally Posted by davidfor View Post
That doesn't make sense. There is no ICU method for counting ADE pages and the only overlapping code between the word count and the ADE page count is opening the epub for counting.

But, assuming you meant there was a big difference between the non-ICU word count with the beta and the word count with the released version, I need examples. And a definition of "really incorrect". When I compared the two, the difference is the number of files in the book. That is expected with the fix I put into the code for the word count.

If the difference is bigger than that, then I need actual examples I can look at including the counts you are seeing with the two versions.
Under Word count options there is a check for Use ICU algorithm for counting words.

ICU on
Code:
Page count: 255.0
Word count using icu_wordcount - trying to count_words
Word count - used count_words: 94052
Word count: 94052
ICU off
Code:
Page count: 255.0
Word count using older method - trying to count_words
Word count: 96049
I think that's a pretty big difference. I've scrambled the ePub so I can attach it.

Last edited by JSWolf; 01-04-2017 at 07:40 PM.
JSWolf is offline   Reply With Quote
Advert
Old 01-04-2017, 07:39 PM   #963
BetterRed
null operator (he/him)
BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.
 
Posts: 21,836
Karma: 30277270
Join Date: Mar 2012
Location: Sydney Australia
Device: none
@JSWolf - maybe hyphenations are counted as one word by ICU and multiple words by non ICU.

Select Tools->Reports->Words in the calibre book editor, filter with '-', save to csv, and get your spreadsheet to accumulate the Times Used column; note bene: if my suggestion is true then forget-me-not would be counted as three words by non-ICU

BR
BetterRed is offline   Reply With Quote
Old 01-04-2017, 07:40 PM   #964
davidfor
Grand Sorcerer
davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.
 
Posts: 24,905
Karma: 47303824
Join Date: Jul 2011
Location: Sydney, Australia
Device: Kobo:Touch,Glo, AuraH2O, GloHD,AuraONE, ClaraHD, Libra H2O; tolinoepos
Quote:
Originally Posted by JSWolf View Post
Under Word count options there is a check for Use ICU algorithm for counting words.

ICU on
Code:
Page count: 255.0
Word count using icu_wordcount - trying to count_words
Word count - used count_words: 94052
Word count: 94052
ICU off
Code:
Page count: 255.0
Word count using older method - trying to count_words
Word count: 96049
I think that's a pretty big difference. I can use the scramble plugin to scramble it and then attach it if you'd like.
Jon: I never would have worked out from your original post that you were talking about the difference between the ICU based word count and the original algorithm.

In any case, go back to the discussion in this thread at this time last year. You were the one that started that "discussion" by pointing out a possible bug. And that was the point of adding the ICU method as it is seemed to handle some things in a better way and was language aware. Back then, I did post explanations of some of the differences if you want to look.

Also, both methods rely on code in calibre. If that is updated, then it might change the count the plugin produces.

Personally, I expect both numbers to be wrong. I tend to think the ICU method is the more accurate, but that is based on me counting very small samples. I take all the statistics produced by the plugin as approximations. And during last year's discussion I was very tempted to introduce a "nearest 1000" option. Of course, that would raise the argument of rounding vs truncating.
davidfor is offline   Reply With Quote
Old 01-04-2017, 07:43 PM   #965
JSWolf
Resident Curmudgeon
JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.
 
JSWolf's Avatar
 
Posts: 80,083
Karma: 147983159
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
The ICU count is more accurate then the non-ICU method. The ICU gets closer to the correct word count. The difference between the word count produced with Word 2016 when the ePub is converted to RTF is 36 difference. Not a lot and not enough to be bothered with. I've scrambled and posted the ePub in the message with the counts if you are interested in seeing it.
JSWolf is offline   Reply With Quote
Advert
Old 01-04-2017, 07:49 PM   #966
davidfor
Grand Sorcerer
davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.
 
Posts: 24,905
Karma: 47303824
Join Date: Jul 2011
Location: Sydney, Australia
Device: Kobo:Touch,Glo, AuraH2O, GloHD,AuraONE, ClaraHD, Libra H2O; tolinoepos
Quote:
Originally Posted by BetterRed View Post
@JSWolf - maybe hyphenations are counted as one word by ICU and multiple words by non ICU.

Select Tools->Reports->Words in the calibre book editor, filter with '-', save to csv, and get your spreadsheet to accumulate the Times Used column; note bene: if my suggestion is true then forget-me-not would be counted as three words by non-ICU
Yes, that's one of the differences. The also handle the other characters that could be used as hyphens differently as well.

And for reference, the editor is using the ICU algorithm. There is a difference in the counts between the plugin and the editor. I haven't gotten around to looking at what the difference is yet.
davidfor is offline   Reply With Quote
Old 01-04-2017, 07:53 PM   #967
JSWolf
Resident Curmudgeon
JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.
 
JSWolf's Avatar
 
Posts: 80,083
Karma: 147983159
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
I have it... Convert to text, count the words, delete the text version and done.
JSWolf is offline   Reply With Quote
Old 01-04-2017, 10:12 PM   #968
davidfor
Grand Sorcerer
davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.
 
Posts: 24,905
Karma: 47303824
Join Date: Jul 2011
Location: Sydney, Australia
Device: Kobo:Touch,Glo, AuraH2O, GloHD,AuraONE, ClaraHD, Libra H2O; tolinoepos
Quote:
Originally Posted by JSWolf View Post
I have it... Convert to text, count the words, delete the text version and done.
Go ahead. Then show me the differences so I can decide. And don't forget to tell me what you consider a word. The delimiters chosen are part of the differences in the two algorithms we are talking about. As I said, I mentioned some of this last year during the discussion.

As I said, I take either count as an approximation. Until someone demonstrates that one or the other is wrong, and how, I am going to accept that they work.
davidfor is offline   Reply With Quote
Old 01-04-2017, 10:32 PM   #969
BetterRed
null operator (he/him)
BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.
 
Posts: 21,836
Karma: 30277270
Join Date: Mar 2012
Location: Sydney Australia
Device: none
Quote:
Originally Posted by davidfor View Post
Yes, that's one of the differences. The also handle the other characters that could be used as hyphens differently as well.

And for reference, the editor is using the ICU algorithm. There is a difference in the counts between the plugin and the editor. I haven't gotten around to looking at what the difference is yet.
The editor includes the content.opf in it's counts and in its spell checking. Try typing 'rubbbbish' into a tag, and 'aaardvaaark' into a comment, then Polish the book with update metadata set and open it in the editor and have a look.

I often remove description, subjects etc from the content.opf (with Sigil) before I use calibre's spell checker, why calibre's spell-checker and not Sigil's - because it's multi-lingual.

I would prefer that the content.opf file not be included in the PI's calculations. I'd also quite like some front and back matter to be excluded, but that's a much bigger ask.

BR
BetterRed is offline   Reply With Quote
Old 01-04-2017, 10:55 PM   #970
theducks
Well trained by Cats
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 31,147
Karma: 60406498
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
Quote:
Originally Posted by davidfor View Post

I'll put it on the list to look at, but I can't promise anything.
(If it had been simple, Kiwidude would have probably done it )
theducks is offline   Reply With Quote
Old 01-04-2017, 10:58 PM   #971
davidfor
Grand Sorcerer
davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.
 
Posts: 24,905
Karma: 47303824
Join Date: Jul 2011
Location: Sydney, Australia
Device: Kobo:Touch,Glo, AuraH2O, GloHD,AuraONE, ClaraHD, Libra H2O; tolinoepos
Quote:
Originally Posted by BetterRed View Post
The editor includes the content.opf in it's counts and in its spell checking. Try typing 'rubbbbish' into a tag, and 'aaardvaaark' into a comment, then Polish the book with update metadata set and open it in the editor and have a look.
Yes your right. I happen to have a book with a single word in the text. The plugin counts one (both methods!) but the editor shows 141. It is also counting the words in the NCX and title tags in each of the files. I suppose it makes sense for the editor as it is running the spelling check on all of these. The plugin doesn't touch these, only the words with the body tags.
Quote:
I often remove description, subjects etc from the content.opf (with Sigil) before I use calibre's spell checker, why calibre's spell-checker and not Sigil's - because it's multi-lingual.
Yes, I like the multilingual spelling. I will sometimes add the lang attribute to a tag to reduce the spelling errors shown when there is non-English dialog.
Quote:
I would prefer that the content.opf file not be included in the PI's calculations. I'd also quite like some front and back matter to be excluded, but that's a much bigger ask.
No, the plugin is looking at the text of the book, not the OPF or any of the metadata objects. Excluding the front and back matter would be good, but trying to decide what is back and front matter is not something I am very interested in doing.
davidfor is offline   Reply With Quote
Old 01-04-2017, 11:50 PM   #972
BetterRed
null operator (he/him)
BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.
 
Posts: 21,836
Karma: 30277270
Join Date: Mar 2012
Location: Sydney Australia
Device: none
Quote:
Originally Posted by davidfor View Post
Excluding the front and back matter would be good, but trying to decide what is back and front matter is not something I am very interested in doing.
As I said "a much bigger ask"

IIRC someone wrote in the earlier discussion, "The only time word count accuracy matters is if someone is paying for or being paid for the words."

BR
BetterRed is offline   Reply With Quote
Old 01-05-2017, 01:23 AM   #973
davidfor
Grand Sorcerer
davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.
 
Posts: 24,905
Karma: 47303824
Join Date: Jul 2011
Location: Sydney, Australia
Device: Kobo:Touch,Glo, AuraH2O, GloHD,AuraONE, ClaraHD, Libra H2O; tolinoepos
YAB - Use Preferred Input Format

OK, here's another beta with the option that BR asked for. The text for the label and tooltip is somewhere between what I had and BR had.

Other than fine-tuning the labels and tooltips or if someone finds a bug, I'm planning for this to be the last beta.
Attached Files
File Type: zip Count Pages-beta.zip (254.1 KB, 183 views)
davidfor is offline   Reply With Quote
Old 01-05-2017, 02:29 AM   #974
BetterRed
null operator (he/him)
BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.
 
Posts: 21,836
Karma: 30277270
Join Date: Mar 2012
Location: Sydney Australia
Device: none
@davidfor - works for me, and as I hoped (knew) it would be, it's a heck of lot faster now it's not redoing the conversion.

Thanks again.

BR

I don't think I'll ever understand why it didn't work this way from the getGo.
BetterRed is offline   Reply With Quote
Old 01-05-2017, 05:35 AM   #975
JSWolf
Resident Curmudgeon
JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.
 
JSWolf's Avatar
 
Posts: 80,083
Karma: 147983159
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
Quote:
Originally Posted by davidfor View Post
Go ahead. Then show me the differences so I can decide. And don't forget to tell me what you consider a word. The delimiters chosen are part of the differences in the two algorithms we are talking about. As I said, I mentioned some of this last year during the discussion.

As I said, I take either count as an approximation. Until someone demonstrates that one or the other is wrong, and how, I am going to accept that they work.
I used Calibre to convert to text. I loaded the text file into Notepad++ and used the Word Count function of the TextFX plugin and I get a count of 94750.

I will leave it up to you to say if this difference is enough to warrant any more changes.
JSWolf is offline   Reply With Quote
Reply

Tags
count, count pages, page count, pages, plugin


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
[GUI Plugin] Quality Check kiwidude Plugins 1252 08-02-2025 09:53 AM
[GUI Plugin] Open With kiwidude Plugins 404 02-21-2025 05:42 AM
[GUI Plugin] Quick Preferences kiwidude Plugins 62 03-16-2024 11:47 PM
[GUI Plugin] Kindle Collections (old) meme Plugins 2070 08-11-2014 12:02 AM
[GUI Plugin] Plugin Updater **Deprecated** kiwidude Plugins 159 06-19-2011 12:27 PM


All times are GMT -4. The time now is 07:45 PM.


MobileRead.com is a privately owned, operated and funded community.