Vietnamese Character Counter

Count Vietnamese characters with stacked-diacritic awareness

Counts user-perceived Vietnamese characters (graphemes), so a stacked letter like ộ counts as one character even though it may be stored as several Unicode code points, plus code-point and byte totals.

Why does ộ count as one character?

A reader sees ộ as a single letter, so it should count as one grapheme. Internally it may be the letter o plus a circumflex modifier plus a below-dot, which is three code points, but grapheme counting groups those into the one character a person perceives.

Vietnamese stacks tone marks and vowel modifiers onto base letters, producing characters like ộ, ự, and ằ. A naive length check can count one of these as several characters, which is wrong for anything a human reads. This counter measures user-perceived characters — graphemes — alongside code points and bytes.

How it works

The tool uses Unicode grapheme segmentation to group each visible character, including a base letter plus all its combining marks, into one unit:

ộ  = o + ◌̂ (circumflex) + ◌̣ (below dot)  → 3 code points → 1 grapheme

It reports three figures: graphemes (what a reader counts), code points (individual Unicode values), and UTF-8 bytes (storage size). For Vietnamese these often differ, especially when text is stored in decomposed (NFD) form rather than precomposed (NFC).

Example and tips

The word một (“one”) is three graphemes (m, ộ, t) even if it is stored as five code points. Use the grapheme count for word-length and display purposes, the byte count for SMS segments and database varchar limits, and the code-point count when debugging encoding. If two seemingly identical strings count differently, one is probably NFC and the other NFD — normalize before comparing.