Generate realistic frequency data
This tool builds a synthetic word frequency list that follows Zipf’s law, the empirical rule that word frequency is roughly inversely proportional to rank. It is handy for stress-testing search ranking, building NLP preprocessing fixtures, and demoing text-analysis tools when you do not want to ship a real corpus.
How it works
For a vocabulary of size N, each word at rank r (from 1 to N) is given an unnormalized weight:
weight(r) = 1 / r^s
where s is the Zipf exponent. The weights are summed and each is scaled so the frequencies add up to a chosen total count, then rounded to integers:
frequency(r) = round( totalCount * weight(r) / sum_of_weights )
Words themselves are generated as short pronounceable tokens, so the tool works for any size without a dictionary. The output gives you rank, word, frequency, and each word’s percentage share of the corpus.
Tips and notes
- An exponent near 1.0 mimics natural English; raise it for a steeper head and lower it for a flatter tail.
- Larger vocabularies produce a long, thin tail — exactly the shape real text exhibits.
- The percentages always sum to roughly 100 percent; small rounding differences are expected at large N.