Arabic text can carry tashkeel — the harakat vowel and pronunciation marks like fatha, damma, kasra, shadda, and sukun. Because these are separate Unicode combining characters, the same sentence has a larger character count when it is fully vocalised than when it is written plain. This counter lets you count both ways and also reports UTF-8 bytes, which is what SMS gateways and database columns actually limit.
How it works
Characters are counted with JavaScript’s Unicode-aware string handling. When you enable Exclude tashkeel, the tool removes these code points before counting:
- Harakat
U+064B–U+0652: fathatan, dammatan, kasratan, fatha, damma, kasra, shadda, sukun. - Superscript alef
U+0670and the Quranic annotation marksU+06D6–U+06ED. - Tatweel / kashida
U+0640, the stretching character used only for justification.
Bytes are computed as the UTF-8 length using TextEncoder, so you see the
real on-the-wire size. Most Arabic letters occupy two bytes in UTF-8, so a line
of Arabic is typically about twice as many bytes as characters.
Example
The vocalised word:
مُحَمَّدٌ
contains the four base letters م ح م د plus several harakat. With tashkeel
included the character count is higher; with Exclude tashkeel enabled it
counts as the base letters only. The byte count reflects UTF-8 encoding either
way.
Notes
- Stripping never alters the text in the box — it only changes the count.
- Use the byte count for SMS segment planning and
VARCHAR/NVARCHARlimits, and the diacritic-free character count for word-processing length checks.