This counter is tuned for Turkish, an agglutinative language where suffixes
stack onto a root to express meaning that other languages spread across several
words. Because a long token like evlerimizden is still one word, the tool
counts by token rather than guessing word boundaries from length.
How it works
The counter matches runs of letters and digits, allowing an internal apostrophe
or hyphen between word characters. This keeps suffixed proper nouns like
Türkiye'de together as one word. Turkish letters such as ç, ğ, ı, İ,
ö, ş, and ü are treated as word characters. Sentences are detected from
terminal punctuation, and unique words use a Turkish-aware lowercasing that maps
dotted İ to i and dotless I to ı.
Tips and example
Take “Evlerimizden çıktık.” It contains two words: the long agglutinative form
evlerimizden and the verb çıktık. The longest-word statistic highlights how
much grammatical information a single Turkish word can carry, which is useful when
estimating reading time or comparing Turkish text length against translations.
Note that Turkish sometimes uses a semicolon mid-sentence; the counter does not
treat it as a sentence boundary, only periods, question marks, exclamation marks,
and ellipses end a sentence.