Vietnamese stacks tone marks and vowel modifiers onto base letters, producing characters like ộ, ự, and ằ. A naive length check can count one of these as several characters, which is wrong for anything a human reads. This counter measures user-perceived characters — graphemes — alongside code points and bytes.
How it works
The tool uses Unicode grapheme segmentation to group each visible character, including a base letter plus all its combining marks, into one unit:
ộ = o + ◌̂ (circumflex) + ◌̣ (below dot) → 3 code points → 1 grapheme
It reports three figures: graphemes (what a reader counts), code points (individual Unicode values), and UTF-8 bytes (storage size). For Vietnamese these often differ, especially when text is stored in decomposed (NFD) form rather than precomposed (NFC).
Example and tips
The word một (“one”) is three graphemes (m, ộ, t) even if it is stored as five
code points. Use the grapheme count for word-length and display purposes, the
byte count for SMS segments and database varchar limits, and the code-point
count when debugging encoding. If two seemingly identical strings count
differently, one is probably NFC and the other NFD — normalize before comparing.