Vietnamese is a tonal language written with a rich set of diacritics. The six tones — level, falling, rising, dipping-rising, broken and heavy — are marked above or below the vowel, and they change a word’s meaning completely. A frequency counter that strips accents would wrongly merge ba (three), bà (grandmother) and bá (to embrace) into one count. This tool keeps every tone and vowel mark intact so each Vietnamese word is counted as the distinct word it truly is.
How it works
- The text is lower-cased so that capitalisation at the start of a sentence does not split a word into two counts.
- It is then split on anything that is not a Vietnamese letter — this includes spaces, punctuation, digits and symbols. The full Vietnamese alphabet is preserved: the base Latin letters plus every accented vowel (
à á ả ã ạ â ầ ấ ẩ ẫ ậ ă …, and the same for e, i, o, u, y) and the letterđ. - Each surviving token (one whitespace-separated syllable) is tallied. Crucially, no accent folding happens —
baandbàremain separate keys in the tally.
The result is sorted from most to least frequent, with each word’s share of the total shown as a percentage.
Tips and example
Paste Bà ba mua ba quả. Bà rất vui. and the counter reports bà ×2, ba ×2, mua ×1, quả ×1, rất ×1, vui ×1 — correctly keeping bà and ba apart even though they differ only by a tone mark.
Use the ranking to catch repetition in your writing, build vocabulary lists for language learners, or verify that a text-processing pipeline is treating Vietnamese diacritics correctly.