Does Traditional Chinese cost more bytes than Simplified?

In UTF-8 both cost 3 bytes per Han character, so the per-character cost is the same. Differences come only from how many characters a phrase uses, not from the script being Traditional or Simplified.

How does UTF-8 compare with Big5?

Big5, the legacy Taiwan encoding, stores most Traditional characters in 2 bytes, while UTF-8 uses 3. UTF-8 is larger per character but covers all of Unicode and is the modern standard for web and databases. This tool measures UTF-8.

Why count bytes instead of characters?

Storage limits, SMS segment sizes, and protocol caps are measured in bytes. A field that holds 255 bytes only fits about 85 Traditional Chinese characters, since each takes 3 bytes. Counting bytes avoids silent truncation.

How accurate is the byte count?

It uses the browser's TextEncoder, which produces standards-compliant UTF-8 — the exact bytes your server or database stores. The number matches production rather than approximating it.

What about rare Hong Kong characters?

Some Cantonese and Hong Kong Supplementary Character Set glyphs live in Unicode planes above U+FFFF and take 4 bytes in UTF-8. The breakdown table counts these in the 4-byte band so the total stays accurate.

What is the Traditional Chinese UTF-8 Byte Counter?

Counts the exact UTF-8 byte length of Traditional Chinese text (Taiwan/Hong Kong), where each Han character costs 3 bytes versus 1 for ASCII. Sizes VARCHAR columns and payloads precisely. Runs in your browser. It runs free in your browser on Gera Tools, with nothing uploaded.

Traditional Chinese UTF-8 Byte Counter

Name: Traditional Chinese UTF-8 Byte Counter
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

Traditional Chinese — used in Taiwan, Hong Kong, and Macau — costs the same three bytes per Han character in UTF-8 as Simplified Chinese. This tool encodes your text with the browser’s real UTF-8 encoder so you can size database columns, SMS segments, and JSON payloads with confidence.

How it works

UTF-8 encodes each character in one to four bytes depending on its Unicode code point:

U+0000 – U+007F   → 1 byte   (ASCII letters, digits)
U+0080 – U+07FF   → 2 bytes  (Latin accents, Greek, Cyrillic)
U+0800 – U+FFFF   → 3 bytes  (Traditional Han, CJK)
U+10000 and above → 4 bytes  (rare HK supplementary chars, emoji)

The tool feeds your text through TextEncoder for an exact byte total, then groups characters into bands so you can see where the weight comes from.

Worked example

The phrase 臺灣 (“Taiwan”) is 2 characters but 6 bytes. Add the English word Taipei and you get 9 characters but 13 bytes, because the seven ASCII characters cost 1 byte each.

Big5 to UTF-8 migration — a critical planning consideration

Taiwan’s legacy Big5 encoding stores most Traditional characters in 2 bytes, while UTF-8 requires 3. This means that a database or file that has historically been sized for Big5 content will not accommodate the same text after a migration to UTF-8 without column width increases. A column sized for 255 Big5 bytes (about 127 Traditional characters) must grow to at least 381 bytes to hold the same 127 characters in UTF-8 — an increase of roughly 50%.

Before running a Big5-to-UTF-8 migration:

Export a sample of the longest values from each column you plan to migrate.
Run them through this byte counter to confirm the actual UTF-8 byte lengths.
Resize columns in your target database before importing.
Re-check any application-level max_length validations that were sized for the Big5 era.

Failing to do this step is a common cause of truncation errors that are particularly difficult to debug because they only affect characters with code points above a certain threshold.

Hong Kong Supplementary Characters

Hong Kong traditionally uses several hundred characters from the Hong Kong Supplementary Character Set (HKSCS) — characters used in Cantonese place names, personal names, and colloquial writing that are not part of the standard Unicode CJK block. Most of these live in Unicode’s supplementary planes (above U+FFFF) and therefore cost 4 bytes in UTF-8 rather than the usual 3. The breakdown table in this tool shows 4-byte characters separately, so if your text includes uncommon Cantonese characters you can see their exact byte contribution. This is particularly relevant for systems that handle personal names or address data from Hong Kong, where a single unusual character in a name can unexpectedly push a field over a byte limit.

SMS and messaging with Traditional Chinese

Traditional Chinese text sent via SMS or certain messaging APIs follows the same UCS-2 rule as Simplified Chinese: each character uses 2 bytes in UCS-2, and a single SMS segment holds 70 characters (versus 160 for GSM-7 ASCII). The byte counter helps you measure UTF-8 storage costs, but for SMS segment counting use the character count (not the byte count) and apply the 70-character-per-segment rule.