Swahili (Kiswahili) is an agglutinative Bantu language, so a great deal of grammar is packed into single words. A verb like ninakupenda (“I love you”) carries the subject (ni-), tense (-na-), object (-ku-) and root (-penda) in one token. This counter gives the usual word, sentence and character totals and additionally detects conjugated verbs by their subject-concord prefix.
How it works
Words are tokenised on whitespace, keeping only tokens that contain letters, so
punctuation alone does not inflate the count. Unique words are found by
lowercasing and stripping non-letters before deduplicating. Sentences are
counted from runs of ., ! and ?.
For verb detection the tool scans each word for a known subject-concord
prefix (ni-, u-, a-, tu-, m-, wa-, ki-, vi-, li-, ya-, zi-, ku- and more),
checked longest-first so wa is not mistaken for a. If that prefix is
immediately followed by a tense or aspect marker (na, li, me, ta, hu, ku,
nge, ki, ka), the word is treated as a conjugated verb and tallied under its
prefix.
Example
The sentence:
Watoto wanacheza uwanjani.
counts as three words. The verb wanacheza begins with the class-2 prefix
wa- followed by the present marker -na-, so it is detected as a conjugated
verb under the cl.2 prefix.
Notes
- Detection is a heuristic. A noun that coincidentally starts with a prefix plus marker may be flagged, and some forms may be missed.
- The breakdown is most useful for spotting how many distinct noun classes a passage’s verbs agree with.
- Counting is case-insensitive for prefix detection but the raw word total preserves the original tokens.