Counting words in Arabic has a few wrinkles that a naive space split gets wrong.
Arabic text often contains tatweel (kashida) stretching characters inserted
purely for justification, and tashkeel vowel marks attached to letters.
Neither should affect the word count. Arabic also writes short particles such as
و (and), ف (so), ب (with), and ل (for) joined directly to the next word,
so they belong to that single word. This counter handles all of that and
reports accurate word, character, and sentence totals.
How it works
The algorithm normalises the text, then splits it:
- Tatweel
U+0640and all tashkeel marks (U+064B–U+0652,U+0670,U+06D6–U+06ED) are removed first, so stretching and vowel marks never change the count. - A word is then a maximal run of
[\p{L}\p{N}](Arabic or Latin letters and digits) with an internal hyphen or apostrophe allowed between two such characters. - Prefixed particles like
wa/fa/bi/laare written with no space before the stem, so they are naturally counted inside the same word rather than as a boundary.
Characters are counted two ways: every character including spaces, and the
length with whitespace removed. Sentences are counted by collapsing runs of
terminal punctuation — the Arabic question mark ؟, plus ., !, ?, … —
into single boundaries.
Example
The text:
ذهب الطالب إلى المدرسة… وكتب الدرس؟
contains the words ذهب, الطالب, إلى, المدرسة, وكتب, الدرس — six
words. The joined particle in وكتب (wa + kataba) is one word, and any tatweel
or tashkeel would have been removed before counting.
Notes
- Because tashkeel is stripped first, the vocalised and unvocalised versions of the same sentence return the same word count.
- Mixed Arabic-English text and Latin product names are counted sensibly because Latin letters are also word characters.