Chinese text draws on a large character set, but any given passage uses only a fraction of it — often with a steep frequency curve. This tool extracts every unique Chinese character, ignores the Latin and punctuation around it, and ranks the characters by how often they appear.
How it works
The tool walks the text character by character and keeps only those whose Unicode code point lies in a Han (CJK) block:
U+3400 – U+4DBF CJK Extension A
U+4E00 – U+9FFF CJK Unified Ideographs (the common hanzi)
U+F900 – U+FAFF CJK Compatibility Ideographs
Everything else — Latin letters, digits, ASCII and full-width punctuation — is
skipped. The surviving characters are tallied, sorted by count, and each is shown
with its U+XXXX code point and its share of all Chinese characters in the text.
Example and tips
The phrase 我爱学习中文 contains 6 characters, all distinct, so the unique count
equals the total. In a longer document the gap widens fast: a few characters such
as 的 是 不 我 recur constantly while most appear only once. Watch the
unique-to-total ratio — a low ratio means the text reuses a tight vocabulary,
which usually makes it easier to read.