Arabic frequency analysis is complicated by optional vowel marks and by several interchangeable letter forms. This tool counts how often each word appears, optionally normalising those variants so the same word is tallied together, and can cluster inflected forms by an approximate three-letter root.
How it works
Tokens are split on whitespace and on Arabic and Latin punctuation. When normalisation is enabled the tool applies two passes before counting:
diacritics → remove harakat U+064B–U+0652, superscript alef U+0670,
shadda, tanwin, and tatweel U+0640
letters → أ إ آ ٱ → ا ; ى → ي ; ة → ه ; ؤ ئ → ء base
For root grouping it strips frequent clitics (the definite article, prepositional and conjunction prefixes, and plural or possessive suffixes) and reduces what remains toward three consonants — an approximation of the Arabic triliteral root.
Example and tips
In a sentence repeating the verb كتب (“he wrote”) alongside الكتاب (“the
book”) and كاتب (“writer”), surface-form counting keeps them separate, while
root grouping clusters them under an approximate كتب root, revealing that the
passage is dominated by the writing theme. Turn normalisation on for messy or
mixed-source text; turn it off when you need exact spelling-by-spelling counts.