Romanian uses several diacritic letters — ă, â, î, and the comma-below
ș and ț. A frequent pitfall is that older documents encode ș/ț with a
cedilla (ş/ţ) instead. Both encodings are valid Unicode letters, so a robust
counter must treat every variant as part of a word rather than as a separator.
This tool does exactly that and reports accurate word, character, and sentence
totals.
How it works
The algorithm treats a word as a maximal run of letters and digits with internal hyphens and apostrophes allowed:
- It matches
[\p{L}\p{N}]runs, permitting an internal-or'between two such characters. - All Romanian diacritics are Unicode letters: comma-below
ș/ț(U+0219/U+021B) and cedillaş/ţ(U+015F/U+0163) alike, plusă,â,î. Every variant counts as a word character. - A hyphen inside a word, as in
într-unors-a, keeps the elided form as one word; a spaced dash used as punctuation separates words.
Characters are counted two ways: every character including spaces, and the
length with whitespace removed. Sentences are counted by collapsing runs of
terminal punctuation (., !, ?, …) so an ellipsis or a ?! combo counts
as one boundary.
Example
The text:
Într-o țară frumoasă, copiii s-au jucat… Nu-i așa?
contains the words Într-o, țară, frumoasă, copiii, s-au, jucat,
Nu-i, așa — eight words. The elided forms Într-o, s-au, and Nu-i each
count as one word because the hyphen has no surrounding spaces.
Notes
- Because the comma-below and cedilla letters are unified at the word-class level, a document that mixes both encodings still counts correctly.
- Numbers like
2026count as one word; a number glued to a suffix by a hyphen, such asanii-1990, stays one compound word.