|
|
#16 | |
|
Sigil Developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 9,740
Karma: 6774572
Join Date: Nov 2009
Device: many
|
For the record, this is what Google's AI recommends:
Quote:
I assume a regex could count Chinese characters and the user could scale that result. |
|
|
|
|
|
|
#17 |
|
Sigil Developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 9,740
Karma: 6774572
Join Date: Nov 2009
Device: many
|
FWIW, here is the github repo:
https://github.com/fxsjy/jieba But as far as I can tell, it has not been touched in over 6 years and there are hundreds of bug reports. Given how it is not maintained, I think I would try regex and set the Unicode Property flag and check the Text checkbox, then try one of these in Find: \p{Script=Han} or \p{Han} Then make sure you set target to All HTML, the hit the count and see what it reports. I would report the exact character count then multiply it by 0.65 to get an estimated word count. Last edited by KevinH; Yesterday at 05:50 PM. |
|
|
|
|
|
#18 |
|
Bibliophagist
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 52,624
Karma: 180945222
Join Date: Jul 2010
Location: Vancouver
Device: Kobo Sage, Libra Colour, Lenovo M8 FHD, Paperwhite 4, Tolino epos
|
The repository one acquaintance of mine likes is bukun's fork:
https://github.com/bukun/jieba-py However he has the advantage of being able to read the documentation. |
|
|
|
|
|
#19 | |
|
Sigil Developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 9,740
Karma: 6774572
Join Date: Nov 2009
Device: many
|
Quote:
So a Sigil plugin to do this is then theoretically possible. |
|
|
|
|
|
|
#20 |
|
Groupie
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 165
Karma: 2000
Join Date: Nov 2025
Device: none
|
Yes, that is really hard and even impossible to count "words" in a given chinese texts, so usually we just count number of individual characters.
Like this in MSword, it says "149149 characters". And the same text Sigil counts 2982. As to how spelling and grammar work in chinese, it's complicate. In short, every character carries certain meaning, so ancient chinese tends to not have words with multiple characters. When we need new meanings, we just invent new characters, which is proven to be dumb soon. So middle and especially modern chinese have compound words, for example, a "black horse" in modern chinese would be "黑马", and 黑 is black, 马 is horse. But you can also say "骊", it's ancient chinese black horse, they invent a new character to describe black horse. But 骊 can also be used in "骊山", which is a mountain, or "探骊得珠", in this case it means to get the pearl under the chin of black dragon, which means to go through a risky adventure and get huge rewards. So it's impossible to count "words" in chinese, as it’s impossible to identify how the author use certain characters. Using Calibre does not work well too, it does count compound words, but have a lot of mistakes, like half of them are wrong. So I just need to count characters, not words. As to the title of this thread, sorry I choose the wrong word. Last edited by icearch; Yesterday at 10:06 PM. |
|
|
|
|
|
#21 |
|
Sigil Developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 9,740
Karma: 6774572
Join Date: Nov 2009
Device: many
|
Did you try Sigil's find and replace in regex mode with the Text box checked and Unicode property enabled? That should properly count characters.
Try one or the other as the Find: \p{Script=Han} or \p{Han} Make sure All HTML files are selected as the target set and hit the # button to get a count. Last edited by KevinH; Yesterday at 10:21 PM. |
|
|
|
|
|
#22 |
|
Groupie
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 165
Karma: 2000
Join Date: Nov 2025
Device: none
|
That works fine enough for my puropse, thank you.
I use \p{Han}|\p{P} so that I can count punctuations, which is traditionly counts as well in character count in chinese. |
|
|
|
![]() |
|
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Word Count? | bdub | Calibre | 3 | 12-20-2025 06:49 PM |
| Add page count, word count and reading time | ZodWallop | Kobo Reader | 4 | 08-12-2024 05:56 AM |
| Word Count and Page Count? | CrossReach | Library Management | 2 | 07-19-2018 05:44 PM |
| Feature Request: Get word count for current article/chapter | truth1ness | Calibre | 0 | 04-02-2015 05:35 PM |
| Word Count | leebase | Calibre | 34 | 06-07-2011 11:53 PM |