Arabic Word Counter

Accurate word count for right-to-left Arabic prose

Count words in right-to-left Arabic text. Strips tatweel (kashida) stretching and tashkeel before splitting, keeps prefixed conjunctions like wa/fa/bi/la attached to their word, and shows characters and sentences. Runs in your browser.

How does it decide what counts as a word?

After removing tatweel and tashkeel, a word is a run of Arabic or Latin letters and digits separated by whitespace or punctuation. Arabic prefixed particles such as wa (و), fa (ف), bi (ب), and la (ل) are written joined to the following word with no space, so they stay part of that single word.

Counting words in Arabic has a few wrinkles that a naive space split gets wrong. Arabic text often contains tatweel (kashida) stretching characters inserted purely for justification, and tashkeel vowel marks attached to letters. Neither should affect the word count. Arabic also writes short particles such as و (and), ف (so), ب (with), and ل (for) joined directly to the next word, so they belong to that single word. This counter handles all of that and reports accurate word, character, and sentence totals.

How it works

The algorithm normalises the text, then splits it:

  • Tatweel U+0640 and all tashkeel marks (U+064BU+0652, U+0670, U+06D6U+06ED) are removed first, so stretching and vowel marks never change the count.
  • A word is then a maximal run of [\p{L}\p{N}] (Arabic or Latin letters and digits) with an internal hyphen or apostrophe allowed between two such characters.
  • Prefixed particles like wa/fa/bi/la are written with no space before the stem, so they are naturally counted inside the same word rather than as a boundary.

Characters are counted two ways: every character including spaces, and the length with whitespace removed. Sentences are counted by collapsing runs of terminal punctuation — the Arabic question mark ؟, plus ., !, ?, — into single boundaries.

Example

The text:

ذهب الطالب إلى المدرسة… وكتب الدرس؟

contains the words ذهب, الطالب, إلى, المدرسة, وكتب, الدرس — six words. The joined particle in وكتب (wa + kataba) is one word, and any tatweel or tashkeel would have been removed before counting.

Notes

  • Because tashkeel is stripped first, the vocalised and unvocalised versions of the same sentence return the same word count.
  • Mixed Arabic-English text and Latin product names are counted sensibly because Latin letters are also word characters.