Finnish Word Counter

Word count for Finnish agglutinative text

Count words in Finnish text where long agglutinative compounds like talossanikaan stay one word. Handles ä, ö, å, hyphenated forms, and shows characters and sentences. Runs in your browser.

How does it decide what counts as a word?

A word is a run of Finnish letters and digits with internal hyphens and apostrophes kept inside. The counter splits on whitespace and punctuation, so a long compound like talossanikaan counts as one word.

Finnish is strongly agglutinative: a single word can carry several case and possessive suffixes and combine multiple stems into one long compound, such as talossanikaan (“not even in my house”) or the textbook lentokonesuihkuturbiinimoottoriapumekaanikkoaliupseerioppilas. Each of these is one orthographic word. This counter applies the right boundary rules so compounds stay whole and you get accurate word, character, and sentence totals.

How it works

The algorithm treats a word as a maximal run of letters and digits with internal hyphens and apostrophes allowed:

  • It matches [\p{L}\p{N}] runs, permitting an internal - or ' between two such characters.
  • The Finnish vowels ä and ö, plus å, are Unicode letters and count as word characters.
  • A hyphen inside a word, as in EU-maa, keeps the form as one word; a spaced dash used as punctuation separates words.

Characters are counted two ways: every character including spaces, and the length with whitespace removed. Sentences are counted by collapsing runs of terminal punctuation (., !, ?, ) so an ellipsis or a ?! combo counts as one boundary.

Example

The text:

Hän asuu talossanikaan… Eikö niin?

contains four words: Hän, asuu, talossanikaan, and Eikö (niin makes five). The long compound talossanikaan is counted as a single word because it has no internal spaces.

Notes

  • Mixed Finnish-English text and Latin product names are counted sensibly because Latin letters are also word characters.
  • Numbers like 2026 count as one word; a number glued to a suffix with a hyphen, such as 2020-luku, stays one compound word.