Swahili Word Counter

Count words in agglutinative Swahili text with noun-class prefix detection

Count words, unique words, and sentences in Kiswahili text, and detect conjugated verbs by their subject-concord prefix chain (ni-, wa-, ki-, vi-…) plus tense markers. Runs locally in your browser.

Why does Swahili need special word counting?

Swahili is agglutinative: a single verb word can pack the subject, tense, object and root together, such as ninakupenda meaning I love you. Word boundaries are still spaces, so counting is straightforward, but understanding the verbs requires reading the prefix chain that this tool surfaces.

Swahili (Kiswahili) is an agglutinative Bantu language, so a great deal of grammar is packed into single words. A verb like ninakupenda (“I love you”) carries the subject (ni-), tense (-na-), object (-ku-) and root (-penda) in one token. This counter gives the usual word, sentence and character totals and additionally detects conjugated verbs by their subject-concord prefix.

How it works

Words are tokenised on whitespace, keeping only tokens that contain letters, so punctuation alone does not inflate the count. Unique words are found by lowercasing and stripping non-letters before deduplicating. Sentences are counted from runs of ., ! and ?.

For verb detection the tool scans each word for a known subject-concord prefix (ni-, u-, a-, tu-, m-, wa-, ki-, vi-, li-, ya-, zi-, ku- and more), checked longest-first so wa is not mistaken for a. If that prefix is immediately followed by a tense or aspect marker (na, li, me, ta, hu, ku, nge, ki, ka), the word is treated as a conjugated verb and tallied under its prefix.

Example

The sentence:

Watoto wanacheza uwanjani.

counts as three words. The verb wanacheza begins with the class-2 prefix wa- followed by the present marker -na-, so it is detected as a conjugated verb under the cl.2 prefix.

Notes

  • Detection is a heuristic. A noun that coincidentally starts with a prefix plus marker may be flagged, and some forms may be missed.
  • The breakdown is most useful for spotting how many distinct noun classes a passage’s verbs agree with.
  • Counting is case-insensitive for prefix detection but the raw word total preserves the original tokens.