Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Sigil

Notices

Reply
 
Thread Tools Search this Thread
Old Yesterday, 05:29 PM   #16
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 9,740
Karma: 6774572
Join Date: Nov 2009
Device: many
For the record, this is what Google's AI recommends:

Quote:
Use Python or Coding Libraries (For Developers)

If you are processing large datasets, you must use a segmentation library. Basic tools will just count each character.

Jieba (结巴): The most popular Python library for Chinese text. It breaks sentences into meaningful words (词) using advanced algorithms.

Code:
import jieba
text = "我爱北京天安门"
words = jieba.lcut(text)
print(len(words)) # Will count correctly as words, not 7 individual characters
The Character vs. Word Dilemma

Professional translators and editors rely on character counts rather than word counts. In Chinese, single characters can be words, but many words are two or three characters long. Generally, a 1,000-character Chinese text translates to about 650 to 750 English words.

I assume a regex could count Chinese characters and the user could scale that result.
KevinH is offline   Reply With Quote
Old Yesterday, 05:35 PM   #17
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 9,740
Karma: 6774572
Join Date: Nov 2009
Device: many
FWIW, here is the github repo:

https://github.com/fxsjy/jieba

But as far as I can tell, it has not been touched in over 6 years and there are hundreds of bug reports.

Given how it is not maintained, I think I would try regex and set the Unicode Property flag and check the Text checkbox, then try one of these in Find:

\p{Script=Han}

or

\p{Han}

Then make sure you set target to All HTML, the hit the count and see what it reports.
I would report the exact character count then multiply it by 0.65 to get an estimated word count.

Last edited by KevinH; Yesterday at 05:50 PM.
KevinH is offline   Reply With Quote
Old Yesterday, 05:58 PM   #18
DNSB
Bibliophagist
DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.
 
DNSB's Avatar
 
Posts: 52,624
Karma: 180945222
Join Date: Jul 2010
Location: Vancouver
Device: Kobo Sage, Libra Colour, Lenovo M8 FHD, Paperwhite 4, Tolino epos
The repository one acquaintance of mine likes is bukun's fork:

https://github.com/bukun/jieba-py

However he has the advantage of being able to read the documentation.
DNSB is offline   Reply With Quote
Old Yesterday, 06:12 PM   #19
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 9,740
Karma: 6774572
Join Date: Nov 2009
Device: many
Quote:
Originally Posted by DNSB View Post
The repository one acquaintance of mine likes is bukun's fork:

https://github.com/bukun/jieba-py

However he has the advantage of being able to read the documentation.
Well at least that fork is being actively maintained!

So a Sigil plugin to do this is then theoretically possible.
KevinH is offline   Reply With Quote
Old Yesterday, 10:01 PM   #20
icearch
Groupie
icearch knows what time it isicearch knows what time it isicearch knows what time it isicearch knows what time it isicearch knows what time it isicearch knows what time it isicearch knows what time it isicearch knows what time it isicearch knows what time it isicearch knows what time it isicearch knows what time it is
 
Posts: 165
Karma: 2000
Join Date: Nov 2025
Device: none
Yes, that is really hard and even impossible to count "words" in a given chinese texts, so usually we just count number of individual characters.

Click image for larger version

Name:	01.png
Views:	4
Size:	37.4 KB
ID:	224031

Like this in MSword, it says "149149 characters". And the same text Sigil counts 2982.

As to how spelling and grammar work in chinese, it's complicate.

In short, every character carries certain meaning, so ancient chinese tends to not have words with multiple characters.

When we need new meanings, we just invent new characters, which is proven to be dumb soon.

So middle and especially modern chinese have compound words, for example, a "black horse" in modern chinese would be "黑马", and 黑 is black, 马 is horse. But you can also say "骊", it's ancient chinese black horse, they invent a new character to describe black horse.

But 骊 can also be used in "骊山", which is a mountain, or "探骊得珠", in this case it means to get the pearl under the chin of black dragon, which means to go through a risky adventure and get huge rewards.

So it's impossible to count "words" in chinese, as it’s impossible to identify how the author use certain characters.

Using Calibre does not work well too, it does count compound words, but have a lot of mistakes, like half of them are wrong.

So I just need to count characters, not words.

As to the title of this thread, sorry I choose the wrong word.

Last edited by icearch; Yesterday at 10:06 PM.
icearch is offline   Reply With Quote
Old Yesterday, 10:18 PM   #21
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 9,740
Karma: 6774572
Join Date: Nov 2009
Device: many
Did you try Sigil's find and replace in regex mode with the Text box checked and Unicode property enabled? That should properly count characters.

Try one or the other as the Find:

\p{Script=Han}

or

\p{Han}

Make sure All HTML files are selected as the target set and hit the # button to get a count.

Last edited by KevinH; Yesterday at 10:21 PM.
KevinH is offline   Reply With Quote
Old Yesterday, 11:00 PM   #22
icearch
Groupie
icearch knows what time it isicearch knows what time it isicearch knows what time it isicearch knows what time it isicearch knows what time it isicearch knows what time it isicearch knows what time it isicearch knows what time it isicearch knows what time it isicearch knows what time it isicearch knows what time it is
 
Posts: 165
Karma: 2000
Join Date: Nov 2025
Device: none
That works fine enough for my puropse, thank you.

I use \p{Han}|\p{P} so that I can count punctuations, which is traditionly counts as well in character count in chinese.
icearch is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Word Count? bdub Calibre 3 12-20-2025 06:49 PM
Add page count, word count and reading time ZodWallop Kobo Reader 4 08-12-2024 05:56 AM
Word Count and Page Count? CrossReach Library Management 2 07-19-2018 05:44 PM
Feature Request: Get word count for current article/chapter truth1ness Calibre 0 04-02-2015 05:35 PM
Word Count leebase Calibre 34 06-07-2011 11:53 PM


All times are GMT -4. The time now is 08:19 AM.


MobileRead.com is a privately owned, operated and funded community.