Chinese UTF-8 Byte Counter

Count the exact UTF-8 byte cost of Chinese text for SMS or database fields

Reports the exact UTF-8 byte length of Chinese text using real encoding, since each CJK character costs 3 bytes. Essential for sizing VARCHAR columns, JSON payloads, and SMS segments. Runs in your browser.

Why does Chinese text use 3 bytes per character?

Most Chinese characters live in the Unicode Basic Multilingual Plane above U+0800. UTF-8 encodes code points in that range as three bytes. ASCII letters and digits stay at one byte, so a mixed string costs somewhere in between.

When you store or transmit Chinese text you usually care about bytes, not characters — because database column limits, SMS segments, and protocol size caps are measured in bytes. This tool encodes your text as real UTF-8 and shows exactly how many bytes it occupies.

How it works

UTF-8 is a variable-width encoding. The number of bytes a character takes depends on its Unicode code point:

U+0000 – U+007F   → 1 byte   (ASCII: A-Z, 0-9, punctuation)
U+0080 – U+07FF   → 2 bytes  (Latin accents, Greek, Cyrillic)
U+0800 – U+FFFF   → 3 bytes  (most CJK: Chinese, Japanese, Korean)
U+10000 and above → 4 bytes  (emoji, rare CJK extensions)

The tool runs your text through TextEncoder, the browser’s standards-compliant UTF-8 encoder, so the byte total is identical to what a server or database would record. It then groups the bytes by sequence length to show where the weight is.

Example and tips

The four-character phrase 你好世界 (“hello world”) is 4 characters but 12 bytes, because each Han character costs 3 bytes. Add the English word Hi in front and you get 7 characters but 15 bytes. If you are designing a byte-limited field, budget roughly three bytes per Chinese character and remember that a single emoji can quietly cost four.