Why does Chinese need word segmentation?

Chinese is written without spaces between words, so software cannot simply split on whitespace. A single string of characters must be broken into words using a dictionary or statistical model, a step known as word segmentation, before words can be counted.

How does the segmentation work?

The tool uses forward maximum matching. Starting at each position it looks for the longest sequence of characters that exists in its built-in dictionary, takes that as one word, and continues. This greedy method is the dictionary stage used by popular segmenters like Jieba.

What happens to words not in the dictionary?

Any Han character that is not part of a recognised dictionary word is counted as a single-character word. That keeps the total reasonable for arbitrary text, though a name or rare term may be split into individual characters rather than one word.

Does it count English words and numbers?

Yes. A run of Latin letters or digits is grouped into a single token and counted as one word, while pure punctuation is ignored. That gives a sensible count for mixed Chinese and English text.

Is the dictionary comprehensive?

The built-in dictionary covers common everyday words so the counter works entirely offline and privately. It is not exhaustive, so for specialised or literary text the count is an approximation. Review the shown segmentation to judge accuracy for your content.

Chinese Simplified Word Counter

Email me this result

Get this tool's output sent to your inbox, plus one useful tool a week. No spam, unsubscribe any time.

Because Simplified Chinese is written with no spaces between words, counting words means first deciding where one word ends and the next begins. This tool performs that segmentation in the browser and then counts the resulting words, showing you exactly how the text was split.

How it works

The counter uses forward maximum matching, the greedy approach at the core of dictionary-based segmenters such as Jieba. Starting from each position in the text, it searches for the longest character sequence that appears in its built-in dictionary and takes that as a single word. If no multi-character word matches, it falls back to counting the lone character as a word.

Runs of Latin letters or digits are grouped into one token each, and standalone punctuation is skipped. The result is a list of word segments and their total count.

Example and notes

A sentence like a common everyday phrase will segment into recognisable multi-character words plus any single characters that the dictionary does not cover. Inspect the highlighted chips to see how ambiguous sequences were split; greedy matching occasionally prefers a longer word where a human might choose two shorter ones. The dictionary is bundled, so segmentation works offline and your text never leaves the browser.