Why are ba and bà counted as different words?

In Vietnamese the tone mark changes the meaning entirely — ba means three, bà means grandmother. They are genuinely different words, so the counter treats every tonal and diacritic variant as a distinct type rather than stripping accents.

How is a word defined?

A word is a run of Vietnamese letters (including all accented vowels and đ) separated by whitespace or punctuation. Each whitespace-separated syllable is counted as one token, which matches how Vietnamese is conventionally tokenised.

Is counting case-sensitive?

No. Words are lower-cased before counting so Nhà and nhà are merged, but tone and vowel diacritics are always kept, so they are never folded away.

Does it handle Vietnamese punctuation?

Yes. Commas, periods, quotation marks and digits are stripped as separators, leaving only the accented Vietnamese letters that form words.

Is my text uploaded anywhere?

No. Tokenising and counting run entirely in your browser. Your text never leaves your device.

What is the Vietnamese Word Frequency Counter?

Ranks word frequencies in Vietnamese text and correctly treats tonal variants such as ba, bà, bá, bả, bã and bạ as distinct words, preserving diacritics and case-folding accurately. It runs free in your browser on Gera Tools, with nothing uploaded.

Vietnamese Word Frequency Counter

Name: Vietnamese Word Frequency Counter
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

Vietnamese is a tonal language written with a rich set of diacritics. The six tones — level, falling, rising, dipping-rising, broken and heavy — are marked above or below the vowel, and they change a word’s meaning completely. A frequency counter that strips accents would wrongly merge ba (three), bà (grandmother) and bá (to embrace) into one count. This tool keeps every tone and vowel mark intact so each Vietnamese word is counted as the distinct word it truly is.

How it works

The text is lower-cased so that capitalisation at the start of a sentence does not split a word into two counts.
It is then split on anything that is not a Vietnamese letter — this includes spaces, punctuation, digits and symbols. The full Vietnamese alphabet is preserved: the base Latin letters plus every accented vowel (à á ả ã ạ â ầ ấ ẩ ẫ ậ ă …, and the same for e, i, o, u, y) and the letter đ.
Each surviving token (one whitespace-separated syllable) is tallied. Crucially, no accent folding happens — ba and bà remain separate keys in the tally.

The result is sorted from most to least frequent, with each word’s share of the total shown as a percentage.

Why standard word counters fail on Vietnamese

Most word frequency tools are designed for European languages where a “word” is simply any sequence of letters between spaces, and diacritics (like French accents or German umlauts) do not change a word’s identity — just its pronunciation. Strip the diacritics and you still have the same word.

Vietnamese is fundamentally different in two ways:

Tone marks change meaning, not just sound. la (to shout), là (to be), lá (leaf), lả (to faint), lã (distilled), lạ (strange) are six completely different words that differ only in the tone mark on the vowel. A tool that normalises or strips diacritics would fold all six into a single count of la, producing meaningless results.

Vietnamese is monosyllabic at the syllable level. Unlike English where words are typically multi-syllable sequences separated by spaces, a Vietnamese syllable is generally a complete meaningful unit. The phrase sách giáo khoa (textbook) is written as three syllables separated by spaces. A naive space-split counter works reasonably well here — each syllable token is a meaningful unit — but it does mean that compound words and multi-syllable proper nouns appear as separate tokens in the frequency count.

Worked example

Paste Bà ba mua ba quả. Bà rất vui. and the counter reports:

Word	Count	%
bà	2	28.6%
ba	2	28.6%
mua	1	14.3%
quả	1	14.3%
rất	1	14.3%

Note that bà (grandmother) and ba (three) are kept apart despite differing only in a tone mark. This is the critical distinction the tool preserves.

Use cases

Writing quality check. Paste an article or essay and scan the top-frequency words. Overused nouns or filler words stand out immediately.
Vocabulary list building for language learners. Paste a text at your study level and see which words appear most. High-frequency words are the ones worth learning first.
NLP pipeline testing. If you are building a Vietnamese text-processing pipeline, paste a sample and verify the counter returns the expected token-to-count mapping, confirming your pipeline preserves diacritics correctly throughout.
Comparative corpus analysis. Paste two texts side by side (separately) and compare their top-10 word lists to see how they differ in lexical focus.