Counting words in Russian needs more care than a naive space split. Russian
typography uses the em-dash heavily as punctuation, while hyphens join genuine
compound words like кто-то and по-русски. This counter applies the correct
boundary rules so compounds stay whole, dashes-as-punctuation split words, and
you get accurate word, character, and sentence totals.
How it works
The algorithm treats a word as a maximal run of letters and digits with internal hyphens and apostrophes allowed:
- It matches
[\p{L}\p{N}]runs, permitting an internal-or'between two such characters. - A hyphen inside a word (no surrounding spaces) keeps the compound as one
word:
что-то,нью-йоркский. - An em-dash
—or en-dash–, which in Russian is almost always set off with spaces, falls outside the word pattern and therefore separates the words on either side.
Characters are counted two ways: every character including spaces, and the
length with whitespace removed. Sentences are counted by collapsing runs of
terminal punctuation (., !, ?, …) so that an ellipsis or a ?! combo
counts as a single sentence boundary.
Example
The text:
Кто-то сказал: «Привет» — и ушёл.
contains the words Кто-то, сказал, Привет, и, ушёл — five words. The
compound Кто-то stays as one word, while the spaced em-dash does not merge
Привет and и.
Notes
- The Latin-letter allowance means mixed Russian-English text and Latin product names are also counted sensibly.
- Numbers like
2026count as one word; a number glued to a unit by a hyphen, such as5-летний, stays a single compound word as Russian expects.