Which characters does the tool count?

It keeps only Han characters in the CJK Unified Ideographs block (U+4E00 to U+9FFF) plus Extension A and compatibility ideographs. Latin letters, ASCII digits, and punctuation — including Chinese full-width punctuation — are excluded so only real hanzi are tallied.

How is unique-character count different from total characters?

Total character count includes every repetition, while unique count is the size of the distinct character set. A 1,000-character article might use only 400 distinct hanzi. The ratio tells you how lexically varied the text is.

Why is this useful for learners?

Reading material is built from a surprisingly small set of high-frequency characters. Ranking the hanzi in something you want to read shows which characters to learn first, since the top few hundred cover the large majority of running text.

What is the code point for?

The Unicode code point uniquely identifies each character regardless of font rendering. It is useful for copying the character into source code, looking it up in a dictionary, or telling apart visually similar glyphs.

Is my text uploaded anywhere?

No. All extraction, counting, and ranking happen in your browser. The text never leaves your device.

What is the Chinese Unique Character Counter?

Extracts every unique CJK character from Chinese text and lists them ranked by frequency, with each character's Unicode code point. Ignores Latin, digits, and punctuation. Useful for vocabulary analysis. Runs in your browser. It runs free in your browser on Gera Tools, with nothing uploaded.

Chinese Unique Character Counter

Name: Chinese Unique Character Counter
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

Chinese text draws on a large character set, but any given passage uses only a fraction of it — often with a steep frequency curve. This tool extracts every unique Chinese character, ignores the Latin and punctuation around it, and ranks the characters by how often they appear.

How it works

The tool walks the text character by character and keeps only those whose Unicode code point lies in a Han (CJK) block:

U+3400 – U+4DBF   CJK Extension A
U+4E00 – U+9FFF   CJK Unified Ideographs (the common hanzi)
U+F900 – U+FAFF   CJK Compatibility Ideographs

Everything else — Latin letters, digits, ASCII and full-width punctuation — is skipped. The surviving characters are tallied, sorted by count, and each is shown with its U+XXXX code point and its share of all Chinese characters in the text.

What the unique-to-total ratio tells you

The most informative single number the tool produces is the ratio of unique characters to total characters. Consider these ranges:

Very low ratio (e.g. 30%) — a small set of characters repeats heavily. Common in simple instructional text, children’s books, or texts that use very plain vocabulary. Easy to read for learners.
Moderate ratio (40–60%) — typical of everyday news articles, popular fiction, and general non-fiction.
High ratio (70%+) — each character tends to appear only once or twice, suggesting complex vocabulary, technical writing, or classical Chinese. A strong signal that the text is difficult.

The phrase 我爱学习中文 gives a ratio of 100% because all 6 characters are distinct. A 1,000-character news article might use only 400 distinct characters (40%) because high-frequency particles and common verbs recur constantly.

Using the frequency list for study

The frequency-ranked output is directly useful for learners who want to build vocabulary coverage in a specific text or topic area:

Sort by frequency (descending) — the characters at the top are the ones that will immediately improve your reading comprehension of this text if you learn them.
Cross-reference with your existing vocabulary — characters you already know can be ticked off, leaving a gap list of what to study.
Focus on high-frequency unknowns — a character that appears 15 times in a text gives you more comprehension gain per hour of study than one that appears once.

For corpus-level frequency work (rather than a single text), standard frequency lists such as the HSK vocabulary tiers give a broader view of which characters are globally high-frequency in written Chinese.

Unicode code points in the output

Each character is shown with its U+XXXX code point for two reasons. First, some characters are visually almost identical in certain fonts — the code point removes any ambiguity. Second, if you are working with the characters in code (writing a regex, building a lookup table, or inserting them into a database), the code point is the unambiguous identifier you need.

Example

Paste the opening lines of a Simplified Chinese news article and you will typically see:

Total: ~200 characters
Unique: ~100–130 characters
Top 5: usually includes 的、是、在、了、中 — these appear in virtually every piece of Mandarin prose and will dominate the top of any frequency ranking.