What is tashkeel and why exclude it?

Tashkeel (harakat) are the small vowel and pronunciation marks placed above or below Arabic letters, such as fatha, damma, kasra, shadda, and sukun. They are separate Unicode combining characters, so vocalised text has more characters than the same text unvocalised. Excluding them counts the base letters only.

Which characters count as diacritics here?

The Arabic combining marks in the range U+064B to U+0652 (fathatan, dammatan, kasratan, fatha, damma, kasra, shadda, sukun), the superscript alef U+0670, and the Quranic annotation marks U+06D6 to U+06ED are treated as tashkeel. The tatweel (kashida, U+0640) stretching character is removed too.

Does it count bytes as well as characters?

Yes. It reports the UTF-8 byte length, which matters for SMS, database column limits, and APIs. Most Arabic letters take two bytes in UTF-8, so the byte count is usually about double the character count.

Both ways. The tool shows total characters including spaces and a separate count excluding all whitespace, so you can match whichever limit you are checking against.

Does it modify my text?

No. Stripping is applied only to compute the diacritic-free counts; the text in the box is never changed. Everything runs locally in your browser and nothing is uploaded.

Why do two identical-looking Arabic texts give different counts?

Usually Unicode normalization or invisible characters. NFD-decomposed text carries hamza as a separate combining mark while NFC text uses precomposed letters like أ, and copied text often contains invisible directional marks (RLM U+200F, LRM U+200E, ALM U+061C) that count as characters.

What is the Arabic Character Counter?

Count Arabic text characters and UTF-8 bytes, with an option to strip tashkeel (harakat) and tatweel (kashida) before counting so diacritics and stretching don't inflate your totals. Runs in your browser. It runs free in your browser on Gera Tools, with nothing uploaded.

Arabic Character Counter

Name: Arabic Character Counter
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

Arabic text can carry tashkeel — the harakat vowel and pronunciation marks like fatha, damma, kasra, shadda, and sukun. Because these are separate Unicode combining characters, the same sentence has a larger character count when it is fully vocalised than when it is written plain. This counter lets you count both ways and also reports UTF-8 bytes, which is what SMS gateways and database columns actually limit.

When you need the diacritic-free count

Different use cases call for different figures. Knowing when to use which avoids mistakes when checking limits:

Social media character limits (Twitter/X, LinkedIn) count by Unicode code point including combining marks, so a vocalised Arabic post is longer than it looks visually.
SMS gateways count bytes in GSM-7 or UCS-2 encoding. Most Arabic characters fall outside GSM-7, so messages encode in UCS-2 at 2 bytes per character, giving 160 ÷ 2 = 70 characters per segment. Use the byte count to plan segment breaks.
Database columns declared as VARCHAR(n) in MySQL measure bytes in latin1 or characters in utf8mb4 depending on the character set — check which your schema uses before assuming the limit.
Editor word-count targets (for articles, subtitles, or translated content) are normally measured in base characters. Use the tashkeel-free count to match the editor’s expectation.

How the character and byte counts are computed

Characters are counted with JavaScript’s Unicode-aware string handling. When you enable Exclude tashkeel, the tool removes these code points before counting:

Harakat U+064B–U+0652: fathatan, dammatan, kasratan, fatha, damma, kasra, shadda, sukun.
Superscript alef U+0670 and the Quranic annotation marks U+06D6–U+06ED.
Tatweel / kashida U+0640, the stretching character used only for justification.

Bytes are computed as the UTF-8 length using TextEncoder, so you see the real on-the-wire size. Most Arabic letters occupy two bytes in UTF-8, so a line of Arabic is typically about twice as many bytes as characters.

Example

The vocalised word:

مُحَمَّدٌ

contains the four base letters م ح م د plus several harakat. With tashkeel included the character count is higher; with Exclude tashkeel enabled it counts as the base letters only. The byte count reflects UTF-8 encoding either way.

Notes

Stripping never alters the text in the box — it only changes the count.
Use the byte count for SMS segment planning and VARCHAR/NVARCHAR limits, and the diacritic-free character count for word-processing length checks.
The vocalised word مُحَمَّدٌ contains 4 base letters plus several combining marks; with tashkeel excluded it counts as 4 characters, not more.
Mixed Arabic-Latin text is handled correctly: Latin letters count as 1 byte each in UTF-8, while Arabic letters count as 2 bytes.

The marks that get stripped, exactly

Code point(s)	Name	Role
U+064B–U+064D	Fathatan, dammatan, kasratan	Tanwin (nunation) endings
U+064E–U+0650	Fatha, damma, kasra	Short vowels
U+0651	Shadda	Consonant gemination
U+0652	Sukun	Absence of a vowel
U+0670	Superscript alef	Dagger alif (e.g. in الله)
U+06D6–U+06ED	Quranic annotation signs	Recitation and pause marks
U+0640	Tatweel (kashida)	Justification stretching only

Everything else — letters, digits, punctuation, spaces — is left untouched by the tashkeel filter.

Counting subtleties worth knowing

Lam-alef counts as two. The ligature لا is rendered as a single glyph, but in normal Unicode text it is encoded as two code points (lam + alef), so it counts as two characters. Only the legacy presentation-form code points (U+FE70–U+FEFF, e.g. U+FEFB for lam-alef) encode ligatures as single characters, and modern keyboards and websites do not produce those — if you paste text containing presentation forms, they are counted as the single code points they are.

Hamza carriers are single letters. أ, إ, آ and ؤ/ئ are precomposed single code points in normal (NFC) text, so each counts as one character. Text that has been decomposed (NFD) can instead carry hamza as a separate combining mark (U+0654/U+0655), which changes the count — if two visually identical strings give different totals, normalization is almost always the reason.

Arabic-Indic digits are ordinary characters. ٠١٢٣ (U+0660–U+0669) and the Eastern variants (U+06F0–U+06F9) count one character each and are never stripped — they are digits, not diacritics.

Right-to-left control characters count. Invisible directional marks such as RLM (U+200F), LRM (U+200E), and ALM (U+061C) are real code points and count as characters. They frequently sneak in when copying from word processors or chat apps, which is another way two “identical” texts can differ in length.

The byte multiplier is not always 2×. Arabic letters and harakat take two bytes each in UTF-8, but any Latin letters, ASCII digits, and ASCII punctuation in the same text take one byte, so mixed text lands between 1× and 2× the character count.

Sources and references

The code-point ranges used to identify diacritics and the encoding sizes are defined by the following standards:

Unicode Arabic block (U+0600–U+06FF) chart — the fatha/damma/kasra/shadda/sukun marks, superscript alef, and Quranic annotation marks stripped in tashkeel-free mode
3GPP TS 23.038 — SMS character sets (GSM 7-bit and UCS-2) — why Arabic SMS encodes in UCS-2 at ~70 characters per segment
MySQL character set support — how VARCHAR limits are measured under latin1 vs utf8mb4

Maintained by the Gera Tools editorial team. Counting runs entirely in your browser; the diacritic set and UTF-8 byte sizing follow the Unicode standard above. Last reviewed 2026-07-02.