Swahili Character Counter

Count characters in Latin-script Swahili text with digraph awareness

Count Swahili (Kiswahili) characters, spaces, and UTF-8 bytes while flagging digraphs like ch, sh, ng, ny, mb and nd that are written with two letters but stand for a single sound. Runs in your browser.

What is a Swahili digraph?

A digraph is a pair of letters that together represent one sound. Swahili uses several, including ch, sh, ng, ng', ny, mb, nd, nj and th. Although they take two characters on the page, each digraph is a single phoneme when spoken.

Swahili (Kiswahili) is written in the Latin alphabet, but several of its sounds are spelled with two letters. These digraphs — such as ch, sh, ng, ny, mb and nd — look like two characters yet represent a single phoneme. This counter reports the usual character and byte totals and also detects those digraphs so you can see a phoneme-aware length.

How it works

Characters are counted using Unicode-aware string handling, so each code point counts once. Bytes are the UTF-8 length from TextEncoder; because Swahili uses plain Latin letters, the byte count is normally very close to the character count.

For digraphs, the text is scanned left to right. At each position the tool checks the longest digraphs first (so ng' is matched before ng). When a digraph is found it is counted once and the scan jumps past both letters. The phoneme-aware length is the number of graphic units that result when every detected digraph collapses to one.

Example

The word:

ngoma

has five raw characters: n, g, o, m, a. Because ng is a digraph, the tool reports one digraph detected and a phoneme-aware length of four units (ng, o, m, a).

Notes

  • The apostrophe form ng' (the velar nasal) is recognised as its own digraph.
  • The byte count is the figure to use for SMS segments and database column limits; the character count is best for word-processing length checks.
  • Counting is case-insensitive for digraph detection, so Ng and ng are treated the same.