Quote:
Originally Posted by BetterRed
A discussion about word counting without any discussion about when hyphenation should be used is a somewhat barren discussion. The latter is to some degree a matter of style, it's also a matter where there are no hard and fast rules. Google 'when to use hyphens' and read what the writerly commentariat and grammarians have to say on hyphenation.
|
Sorry, I don't think the discussion is about when to use hyphenation. It's about what is considered a word. Or, for the sake of using an algorithm that doesn't need a dictionary or full grammar, what is a word delimiter.
Quote:
The existing algorithm is NOT a bug - putting a Whitworth nut on a Metric bolt is a bug. But that JSWolf regards all opinions, other than his own, as bugs, is a proven fact
|
JSWolf reported three cases. The ellipses example is a bug (always two words) and the em-dash is probably a bug (two words in most circumstances). The en-dash is harder. I think it would be two words in more cases than one word, but I'm not sure. And Katsunami has pointed to another example; the soft-hyphen should not be a word delimiter. That to me is a worse problem
Quote:
It's an issue of which algorithm to use. The one that has stood those who use it in good stead for nigh on 5 years. Or one that has only become available in recent times. There would be no discussion, from me at least, if the proposal was to add an option to use the existing or the ICU algorithms when computing word counts. IMO, adding an option would be in the 'spirit' of the original developer, who usually (always ?) protected 'legacy' features. If at all possible, existing 'installs' would set the option to use the 'legacy' algorithm, new installs would default to the 'ICU' algorithm.
|
The problem with the option is that there is no real long-term benefit in having it. If we were having options to use "ICU" vs "Simple spaces delimiter" vs "Some wacky word count method I found on the net", then yes, options for the choice. To me, the number of people that are interested in the exact number of words is few. And most of them will be horrified that the count is wrong and want it fixed. For most of the rest, an approximation is good enough and that is why my initial reaction was "who cares". When I looked again, the language/locale issues was what made me decide to look at the changes.
Saying that it has served the users for five years is a problem. The code for both methods is in calibre. Are you sure that the existing method hasn't changed in five years? Are you sure it won't change in the future? I'm a little surprised that when Kovid implemented the ICU method that he didn't remove the old method. Sure, he would have left the interface, but that would have just pointed to the new code.
And for changes to the algorithm, if it had been implemented completely inside Count Pages and the issue was that, for example, the ellipses character was not in the list of word delimiter characters, I would have had no hesitation in adding it. Would you expect an option to keep the old in that case?
Quote:
Support forums are riddled with complaints about Apple, MS, Google etc blithely clobbering/discontinuing existing features. Less so with IBM, if you're minded you can definitely run IEBGENER and probably DISSOS or PROFS on your shiny new z/OS system.
|
"PROFS", I haven't heard mention of that for a LONG time.
Quote:
Facetiously, one might suggest an option to include the components of hyphenated words in the word count if they are present in designated dictionaries. Thus, the compound word 'so-called' would likely be counted as two words, whereas 'topsy-turvy' would likely be counted as one. But realistically one wouldn't — would one?
|
Sorry, I would count a hyphenated compound word as one word. Anything with hyphens connecting the parts is to me one word. That isn't the problem, it's working out what a hyphen is. And what other characters should not be treated as word delimiters.
Quote:
===============
An unrelated feature I'd like to see in Count Pages, is an option to use the format file with the latest file system modification date as the basis for counting. In my workflow that would avoid in-flight conversions to EPUB - because in 99% of cases, I Convert from non-EPUB to EPUB immediately prior to running Count Pages.
NB: EPUB is not even close to being near the top of my preferred input format list, although it is my designated output format. I rarely need to convert from EPUB, when I do it's unlikely I would then run Count Pages. I would typically attach the output format file to an email, send it, and then remove the format from the library.
|
I hadn't looked at that part of the code before, but when choosing which format to use, your preferred input format is used. So, if the epub is always being counted, it must be above the other formats in your calibre preferences. Looking for the most recently changed is possible, but it's enough outside the existing code that I can't say that I'm interested in adding it.