Why does Chinese text use 3 bytes per character?

Most Chinese characters live in the Unicode Basic Multilingual Plane above U+0800. UTF-8 encodes code points in that range as three bytes. ASCII letters and digits stay at one byte, so a mixed string costs somewhere in between.

How is this different from counting characters?

A character count tells you how many symbols there are; a byte count tells you how much storage or bandwidth they need. One Chinese character is one character but three bytes. Databases and protocols care about bytes, not glyphs.

Why does this matter for a VARCHAR column?

In some databases VARCHAR length limits are measured in bytes. A VARCHAR(255) byte-limited column holds only about 85 Chinese characters, since each takes three bytes. Counting bytes prevents truncation surprises.

How does the tool count bytes?

It uses the browser's built-in TextEncoder, which produces the exact same UTF-8 bytes your server and database would store. This is the real encoding, not an estimate, so the figure matches production.

Do emoji or rare characters change the count?

Yes. Emoji and characters in supplementary planes (above U+FFFF) take four bytes each in UTF-8. The breakdown table shows these separately so you are not caught out by a stray emoji.

What is the Chinese UTF-8 Byte Counter?

Reports the exact UTF-8 byte length of Chinese text using real encoding, since each CJK character costs 3 bytes. Essential for sizing VARCHAR columns, JSON payloads, and SMS segments. Runs in your browser. It runs free in your browser on Gera Tools, with nothing uploaded.

Chinese UTF-8 Byte Counter

Name: Chinese UTF-8 Byte Counter
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

When you store or transmit Chinese text you usually care about bytes, not characters — because database column limits, SMS segments, and protocol size caps are measured in bytes. This tool encodes your text as real UTF-8 and shows exactly how many bytes it occupies.

How it works

UTF-8 is a variable-width encoding. The number of bytes a character takes depends on its Unicode code point:

U+0000 – U+007F   → 1 byte   (ASCII: A-Z, 0-9, punctuation)
U+0080 – U+07FF   → 2 bytes  (Latin accents, Greek, Cyrillic)
U+0800 – U+FFFF   → 3 bytes  (most CJK: Chinese, Japanese, Korean)
U+10000 and above → 4 bytes  (emoji, rare CJK extensions)

The tool runs your text through TextEncoder, the browser’s standards-compliant UTF-8 encoder, so the byte total is identical to what a server or database would record. It then groups the bytes by sequence length to show where the weight is.

Worked example

The four-character phrase 你好世界 (“hello world”) is 4 characters but 12 bytes, because each Han character costs 3 bytes. Add the English word Hi in front and you get 7 characters but 15 bytes — 3 ASCII bytes plus 12 Chinese bytes.

For a quick mental model: multiply your Chinese character count by 3, add 1 per ASCII character, and you get a close estimate. The breakdown table shows the exact split so you can verify this for any input.

Where byte counts catch developers out

VARCHAR column sizing

MySQL and MariaDB can define VARCHAR columns in either characters or bytes depending on the CHARACTER SET and column definition. A VARCHAR(255) in utf8mb4 (the correct encoding for Chinese and emoji) is measured in characters, so it holds 255 Chinese characters — fine. But older utf8 columns, or byte-based limits in some frameworks, may cap at 255 bytes, holding only about 85 Chinese characters. The byte counter lets you verify your actual payload before you hit a truncation error in production.

SMS segments

SMS messages in the GSM-7 encoding hold 160 characters per segment, but Chinese text requires UCS-2, which limits each segment to 70 characters — regardless of byte count. However, many SMS APIs that bill by “characters” actually bill by bytes in their underlying protocol. Use the byte count alongside the character count when estimating API costs for Chinese SMS campaigns.

HTTP and JSON field limits

Some HTTP APIs impose a byte cap on request or response bodies, headers, or individual fields. A field that “accepts up to 500 characters” sometimes means 500 bytes at the transport layer. Checking byte length before submission avoids confusing 400 errors that only trigger for Chinese-language inputs.

Redis key and value limits

Redis allows keys up to 512 MB but many deployments impose much lower soft limits for performance reasons. If Chinese product names or user-generated content becomes a Redis key, the byte count tells you the true memory footprint.

Big5 vs UTF-8

If you are migrating legacy data from a Big5-encoded system (common for Traditional Chinese from Taiwan) to a UTF-8 database, note that Big5 stores most characters in 2 bytes while UTF-8 uses 3. Column widths that were sized for Big5 may need to increase by up to 50% to accommodate the same text in UTF-8. This counter helps you size the new columns before running the migration.