Finnish is strongly agglutinative: a single word can carry several case and
possessive suffixes and combine multiple stems into one long compound, such as
talossanikaan (“not even in my house”) or the textbook
lentokonesuihkuturbiinimoottoriapumekaanikkoaliupseerioppilas. Each of these
is one orthographic word. This counter applies the right boundary rules so
compounds stay whole and you get accurate word, character, and sentence totals.
How it works
The algorithm treats a word as a maximal run of letters and digits with internal hyphens and apostrophes allowed:
- It matches
[\p{L}\p{N}]runs, permitting an internal-or'between two such characters. - The Finnish vowels
äandö, pluså, are Unicode letters and count as word characters. - A hyphen inside a word, as in
EU-maa, keeps the form as one word; a spaced dash used as punctuation separates words.
Characters are counted two ways: every character including spaces, and the
length with whitespace removed. Sentences are counted by collapsing runs of
terminal punctuation (., !, ?, …) so an ellipsis or a ?! combo counts
as one boundary.
Example
The text:
Hän asuu talossanikaan… Eikö niin?
contains four words: Hän, asuu, talossanikaan, and Eikö (niin makes
five). The long compound talossanikaan is counted as a single word because it
has no internal spaces.
Notes
- Mixed Finnish-English text and Latin product names are counted sensibly because Latin letters are also word characters.
- Numbers like
2026count as one word; a number glued to a suffix with a hyphen, such as2020-luku, stays one compound word.