Swahili (Kiswahili) is written in the Latin alphabet, but several of its sounds are spelled with two letters. These digraphs — such as ch, sh, ng, ny, mb and nd — look like two characters yet represent a single phoneme. This counter reports the usual character and byte totals and also detects those digraphs so you can see a phoneme-aware length.
How it works
Characters are counted using Unicode-aware string handling, so each code point
counts once. Bytes are the UTF-8 length from TextEncoder; because Swahili
uses plain Latin letters, the byte count is normally very close to the character
count.
For digraphs, the text is scanned left to right. At each position the tool
checks the longest digraphs first (so ng' is matched before ng). When a
digraph is found it is counted once and the scan jumps past both letters. The
phoneme-aware length is the number of graphic units that result when every
detected digraph collapses to one.
Example
The word:
ngoma
has five raw characters: n, g, o, m, a. Because ng is a digraph, the tool
reports one digraph detected and a phoneme-aware length of four units (ng, o, m,
a).
Notes
- The apostrophe form
ng'(the velar nasal) is recognised as its own digraph. - The byte count is the figure to use for SMS segments and database column limits; the character count is best for word-processing length checks.
- Counting is case-insensitive for digraph detection, so Ng and ng are treated the same.