Word Frequency List Generator

Generate a Zipf-distributed word frequency list of any size.

Ad placeholder (leaderboard)

Generate realistic frequency data

This tool builds a synthetic word frequency list that follows Zipf’s law, the empirical rule that word frequency is roughly inversely proportional to rank. It is handy for stress-testing search ranking, building NLP preprocessing fixtures, and demoing text-analysis tools when you do not want to ship a real corpus.

How it works

For a vocabulary of size N, each word at rank r (from 1 to N) is given an unnormalized weight:

weight(r) = 1 / r^s

where s is the Zipf exponent. The weights are summed and each is scaled so the frequencies add up to a chosen total count, then rounded to integers:

frequency(r) = round( totalCount * weight(r) / sum_of_weights )

Words themselves are generated as short pronounceable tokens, so the tool works for any size without a dictionary. The output gives you rank, word, frequency, and each word’s percentage share of the corpus.

Tips and notes

  • An exponent near 1.0 mimics natural English; raise it for a steeper head and lower it for a flatter tail.
  • Larger vocabularies produce a long, thin tail — exactly the shape real text exhibits.
  • The percentages always sum to roughly 100 percent; small rounding differences are expected at large N.
Ad placeholder (rectangle)