Czech orthography is rich in diacritics: the háček (caron) gives č, š, ž,
ř, ě, ň, ď, ť, the kroužek gives ů, and acute accents give á, é,
í, ó, ú, ý. A counter that classifies characters poorly can split a word
at a diacritic. This tool relies on full Unicode letter classification so every
Czech letter stays inside its word, and reports accurate word, character, and
sentence totals.
How it works
The algorithm treats a word as a maximal run of letters and digits with internal hyphens and apostrophes allowed:
- It matches
[\p{L}\p{N}]runs, permitting an internal-or'between two such characters. - Every Czech diacritic letter is a Unicode letter and counts as a word
character:
č š ž ř ě ň ď ť ů á é í ó ú ýand their uppercase forms. - A hyphen inside a word, as in
česko-slovenský, keeps the compound as one word; a spaced dash used as punctuation separates words.
Characters are counted two ways: every character including spaces, and the
length with whitespace removed. Sentences are counted by collapsing runs of
terminal punctuation (., !, ?, …) so an ellipsis or a ?! combo counts
as one boundary.
Example
The text:
Žluťoučký kůň… Příliš ano? Ne-li.
contains the words Žluťoučký, kůň, Příliš, ano, Ne-li — five words.
Every diacritic stays inside its word, and the hyphen keeps Ne-li whole.
Notes
- Mixed Czech-English text and Latin product names are counted sensibly because Latin letters are also word characters.
- Numbers like
2026count as one word; a number glued to a suffix by a hyphen, such as90-tých, stays one compound word.