Vietnamese Word Frequency Counter

Count word frequencies in Vietnamese — each tonal syllable is a separate type

Ranks word frequencies in Vietnamese text and correctly treats tonal variants such as ba, bà, bá, bả, bã and bạ as distinct words, preserving diacritics and case-folding accurately.

Why are ba and bà counted as different words?

In Vietnamese the tone mark changes the meaning entirely — ba means three, bà means grandmother. They are genuinely different words, so the counter treats every tonal and diacritic variant as a distinct type rather than stripping accents.

Vietnamese is a tonal language written with a rich set of diacritics. The six tones — level, falling, rising, dipping-rising, broken and heavy — are marked above or below the vowel, and they change a word’s meaning completely. A frequency counter that strips accents would wrongly merge ba (three), (grandmother) and (to embrace) into one count. This tool keeps every tone and vowel mark intact so each Vietnamese word is counted as the distinct word it truly is.

How it works

  1. The text is lower-cased so that capitalisation at the start of a sentence does not split a word into two counts.
  2. It is then split on anything that is not a Vietnamese letter — this includes spaces, punctuation, digits and symbols. The full Vietnamese alphabet is preserved: the base Latin letters plus every accented vowel (à á ả ã ạ â ầ ấ ẩ ẫ ậ ă …, and the same for e, i, o, u, y) and the letter đ.
  3. Each surviving token (one whitespace-separated syllable) is tallied. Crucially, no accent folding happens — ba and remain separate keys in the tally.

The result is sorted from most to least frequent, with each word’s share of the total shown as a percentage.

Tips and example

Paste Bà ba mua ba quả. Bà rất vui. and the counter reports ×2, ba ×2, mua ×1, quả ×1, rất ×1, vui ×1 — correctly keeping and ba apart even though they differ only by a tone mark.

Use the ranking to catch repetition in your writing, build vocabulary lists for language learners, or verify that a text-processing pipeline is treating Vietnamese diacritics correctly.