How are Arabic words counted?

The tool splits the text on spaces, line breaks, and Arabic and Latin punctuation, then tallies identical surface forms. With normalisation on, it first strips short-vowel diacritics and unifies letter variants so that the same word written slightly differently is counted together.

What does diacritic normalisation do?

Arabic short vowels and other tashkeel marks are optional in writing. Normalisation removes harakat (fatha, kasra, damma, sukun, shadda, tanwin) and the tatweel elongation, so a word counts the same whether or not it carries vowel marks.

What does letter normalisation change?

It unifies common interchangeable forms: the various hamza-bearing alefs become a plain alef, the alef maqsura becomes yaa, and the taa marbuta becomes haa. This merges spelling variants that readers treat as the same word.

How accurate is the root grouping?

Arabic roots normally need a morphological analyser. This tool uses a lightweight approximation: it strips common prefixes and suffixes and reduces a word toward its likely three strong consonants. It is a useful clustering heuristic, not a substitute for a full morphology engine.

Is my text sent anywhere?

No. All tokenising, normalising, and counting happen in your browser. The text is never uploaded.

What is the Arabic Word Frequency Counter?

Tallies word frequencies in Arabic text, optionally normalising diacritics and letter variants, and can group inflected forms by a shared three-letter root approximation. Ranks words by count. Runs in your browser. It runs free in your browser on Gera Tools, with nothing uploaded.

Arabic Word Frequency Counter

Name: Arabic Word Frequency Counter
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

Arabic frequency analysis is complicated by optional vowel marks and by several interchangeable letter forms. This tool counts how often each word appears, optionally normalising those variants so the same word is tallied together, and can cluster inflected forms by an approximate three-letter root.

How it works

Tokens are split on whitespace and on Arabic and Latin punctuation. When normalisation is enabled the tool applies two passes before counting:

diacritics → remove harakat U+064B–U+0652, superscript alef U+0670,
             shadda, tanwin, and tatweel U+0640
letters    → أ إ آ ٱ → ا   ;   ى → ي   ;   ة → ه   ;   ؤ ئ → ء base

For root grouping it strips frequent clitics (the definite article, prepositional and conjunction prefixes, and plural or possessive suffixes) and reduces what remains toward three consonants — an approximation of the Arabic triliteral root.

Worked example

Consider a short passage containing the three words: كَتَبَ (he wrote), الكِتَاب (the book), and كَاتِب (a writer). Without normalisation these count as three distinct types with frequency 1 each.

With diacritic normalisation: all three strip to كتب, الكتاب, كاتب — still three types, but now the same word written with or without harakat collapses correctly.

With root grouping: the tool strips the definite article from الكتاب (giving كتاب) and removes the plural/pattern to approximate the root كتب. All three cluster together as the top root, revealing that the passage centres on the writing theme.

When to use each normalisation mode

No normalisation (surface forms): Use when you need to count exact spellings — for proofreading consistency (does the author spell a word differently across chapters?), for character-level language modelling data, or when the distinction between vowelled and unvowelled forms matters to your task.

Diacritic normalisation only: The most common choice for general corpus analysis. Merges the same word whether or not it was typed with harakat, which is the biggest source of spurious duplicates in real-world Arabic text. Does not merge different words that happen to share consonants.

Diacritic + letter normalisation: Adds the unification of alef variants, alef maqsura, ta-marbuta, and hamza-bearing letters. Use when your source text is messy or comes from multiple editors or systems with different encoding conventions.

Root grouping: Use when you want a thematic summary of the text rather than a word-for-word tally. Useful for identifying the dominant topic of a passage, comparing the thematic focus of two texts, or generating a rough keyword cloud. Not suitable for precise linguistic analysis because the root approximation introduces errors.

Interpreting the ranked list

The output ranks words or roots by descending frequency. A few things to watch for:

Function words dominate. Like every language, Arabic’s most frequent forms are the particles, prepositions, and pronouns (في “in”, من “from”, هذا “this”, على “on”). If you are doing content analysis, consider filtering the top 50–100 most common Arabic words (a stop-word list) to reveal the content-bearing words.

Low-frequency words reveal vocabulary range. Words that appear only once (hapax legomena) indicate domain-specific vocabulary or unique names. Their share of the total is a rough indicator of lexical richness.

Root-group sizes reveal topic focus. A root that groups ten surface forms is more central to the text than one that groups two, even if the raw frequency counts are similar.