Urdu is written in the Nastaliq style of the Arabic script but adds its own letters and layers optional diacritics on top of base letters. A naive character count mixes all of these together. This counter breaks the text down into the categories that actually matter.
How it works
The text is iterated one Unicode code point at a time. Each code point is sorted into a bucket:
- Combining diacritics (harakat / aerab) in ranges like U+064B–U+065F and the superscript alef U+0670 are counted as diacritics.
- A curated set of Urdu-only letters (ٹ ڈ ڑ ں ہ ھ ے گ پ چ ژ) is counted as Urdu-specific.
- Any other Arabic-block letter (U+0600–U+06FF) is counted as shared.
The headline figure, characters excluding diacritics, is the total code points
minus the diacritics — the count that corresponds to the visible base letters.
Example and notes
For اردو ٹھیک ہے the counter reports the visible letters separately from any
aerab you add, and flags ٹ and ہ as Urdu-only. Note that the gol he (ہ) and
do-chashmi he (ھ) are distinct code points used for different sounds, so both are
treated as Urdu-specific. If you are checking an SMS or username length limit,
use the diacritic-excluded count, since most systems measure base characters.