Word count feature - Page 2

KevinH · Yesterday, 05:29 PM

For the record, this is what Google's AI recommends:

Quote:

Use Python or Coding Libraries (For Developers)

If you are processing large datasets, you must use a segmentation library. Basic tools will just count each character.

Jieba (结巴): The most popular Python library for Chinese text. It breaks sentences into meaningful words (词) using advanced algorithms.

Code:

import jieba
text = "我爱北京天安门"
words = jieba.lcut(text)
print(len(words)) # Will count correctly as words, not 7 individual characters

The Character vs. Word Dilemma

Professional translators and editors rely on character counts rather than word counts. In Chinese, single characters can be words, but many words are two or three characters long. Generally, a 1,000-character Chinese text translates to about 650 to 750 English words.

I assume a regex could count Chinese characters and the user could scale that result.

KevinH · Yesterday, 05:35 PM

FWIW, here is the github repo:

https://github.com/fxsjy/jieba

But as far as I can tell, it has not been touched in over 6 years and there are hundreds of bug reports.

Given how it is not maintained, I think I would try regex and set the Unicode Property flag and check the Text checkbox, then try one of these in Find:

\p{Script=Han}

or

\p{Han}

Then make sure you set target to All HTML, the hit the count and see what it reports.
I would report the exact character count then multiply it by 0.65 to get an estimated word count.

DNSB · Yesterday, 05:58 PM

The repository one acquaintance of mine likes is bukun's fork:

https://github.com/bukun/jieba-py

However he has the advantage of being able to read the documentation.

KevinH · Yesterday, 06:12 PM

Quote:

Originally Posted by DNSB

The repository one acquaintance of mine likes is bukun's fork:

https://github.com/bukun/jieba-py

However he has the advantage of being able to read the documentation.

Well at least that fork is being actively maintained!

So a Sigil plugin to do this is then theoretically possible.

icearch · Yesterday, 10:01 PM

Yes, that is really hard and even impossible to count "words" in a given chinese texts, so usually we just count number of individual characters.

Click image for larger version

Name: 01.png
Views: 4
Size: 37.4 KB
ID: 224031

Like this in MSword, it says "149149 characters". And the same text Sigil counts 2982.

As to how spelling and grammar work in chinese, it's complicate.

In short, every character carries certain meaning, so ancient chinese tends to not have words with multiple characters.

When we need new meanings, we just invent new characters, which is proven to be dumb soon.

So middle and especially modern chinese have compound words, for example, a "black horse" in modern chinese would be "黑马", and 黑 is black, 马 is horse. But you can also say "骊", it's ancient chinese black horse, they invent a new character to describe black horse.

But 骊 can also be used in "骊山", which is a mountain, or "探骊得珠", in this case it means to get the pearl under the chin of black dragon, which means to go through a risky adventure and get huge rewards.

So it's impossible to count "words" in chinese, as it’s impossible to identify how the author use certain characters.

Using Calibre does not work well too, it does count compound words, but have a lot of mistakes, like half of them are wrong.

So I just need to count characters, not words.

As to the title of this thread, sorry I choose the wrong word.

KevinH · Yesterday, 10:18 PM

Did you try Sigil's find and replace in regex mode with the Text box checked and Unicode property enabled? That should properly count characters.

Try one or the other as the Find:

\p{Script=Han}

or

\p{Han}

Make sure All HTML files are selected as the target set and hit the # button to get a count.

icearch · Yesterday, 11:00 PM

That works fine enough for my puropse, thank you.

I use \p{Han}|\p{P} so that I can count punctuations, which is traditionly counts as well in character count in chinese.

Yesterday, 05:35 PM	#17
KevinH Sigil Developer Posts: 9,740 Karma: 6774572 Join Date: Nov 2009 Device: many	FWIW, here is the github repo: https://github.com/fxsjy/jieba But as far as I can tell, it has not been touched in over 6 years and there are hundreds of bug reports. Given how it is not maintained, I think I would try regex and set the Unicode Property flag and check the Text checkbox, then try one of these in Find: \p{Script=Han} or \p{Han} Then make sure you set target to All HTML, the hit the count and see what it reports. I would report the exact character count then multiply it by 0.65 to get an estimated word count. Last edited by KevinH; Yesterday at 05:50 PM.

Yesterday, 10:01 PM	#20
icearch Groupie Posts: 165 Karma: 2000 Join Date: Nov 2025 Device: none	Yes, that is really hard and even impossible to count "words" in a given chinese texts, so usually we just count number of individual characters. Like this in MSword, it says "149149 characters". And the same text Sigil counts 2982. As to how spelling and grammar work in chinese, it's complicate. In short, every character carries certain meaning, so ancient chinese tends to not have words with multiple characters. When we need new meanings, we just invent new characters, which is proven to be dumb soon. So middle and especially modern chinese have compound words, for example, a "black horse" in modern chinese would be "黑马", and 黑 is black, 马 is horse. But you can also say "骊", it's ancient chinese black horse, they invent a new character to describe black horse. But 骊 can also be used in "骊山", which is a mountain, or "探骊得珠", in this case it means to get the pearl under the chin of black dragon, which means to go through a risky adventure and get huge rewards. So it's impossible to count "words" in chinese, as it’s impossible to identify how the author use certain characters. Using Calibre does not work well too, it does count compound words, but have a lot of mistakes, like half of them are wrong. So I just need to count characters, not words. As to the title of this thread, sorry I choose the wrong word. Last edited by icearch; Yesterday at 10:06 PM.

Yesterday, 10:18 PM	#21
KevinH Sigil Developer Posts: 9,740 Karma: 6774572 Join Date: Nov 2009 Device: many	Did you try Sigil's find and replace in regex mode with the Text box checked and Unicode property enabled? That should properly count characters. Try one or the other as the Find: \p{Script=Han} or \p{Han} Make sure All HTML files are selected as the target set and hit the # button to get a count. Last edited by KevinH; Yesterday at 10:21 PM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Word Count?	bdub	Calibre	3	12-20-2025 06:49 PM
Add page count, word count and reading time	ZodWallop	Kobo Reader	4	08-12-2024 05:56 AM
Word Count and Page Count?	CrossReach	Library Management	2	07-19-2018 05:44 PM
Feature Request: Get word count for current article/chapter	truth1ness	Calibre	0	04-02-2015 05:35 PM
Word Count	leebase	Calibre	34	06-07-2011 11:53 PM

Yesterday, 05:58 PM	#18
DNSB Bibliophagist Posts: 52,624 Karma: 180945222 Join Date: Jul 2010 Location: Vancouver Device: Kobo Sage, Libra Colour, Lenovo M8 FHD, Paperwhite 4, Tolino epos	The repository one acquaintance of mine likes is bukun's fork: https://github.com/bukun/jieba-py However he has the advantage of being able to read the documentation.

Yesterday, 11:00 PM	#22
icearch Groupie Posts: 165 Karma: 2000 Join Date: Nov 2025 Device: none	That works fine enough for my puropse, thank you. I use \p{Han}\|\p{P} so that I can count punctuations, which is traditionly counts as well in character count in chinese.