Romanian Word Counter

Word count for Romanian handling ș/ț diacritic variants

Count words in Romanian text with correct Unicode handling of both comma-below (ș, ț) and the legacy cedilla variants (ş, ţ), plus ă, â, î. Shows words, characters, and sentences. Runs in your browser.

How does it decide what counts as a word?

A word is a run of Romanian letters and digits with internal hyphens and apostrophes kept inside. The counter splits on whitespace and punctuation, so words like și, țară, and într-un are handled correctly.

Romanian uses several diacritic letters — ă, â, î, and the comma-below ș and ț. A frequent pitfall is that older documents encode ș/ț with a cedilla (ş/ţ) instead. Both encodings are valid Unicode letters, so a robust counter must treat every variant as part of a word rather than as a separator. This tool does exactly that and reports accurate word, character, and sentence totals.

How it works

The algorithm treats a word as a maximal run of letters and digits with internal hyphens and apostrophes allowed:

  • It matches [\p{L}\p{N}] runs, permitting an internal - or ' between two such characters.
  • All Romanian diacritics are Unicode letters: comma-below ș/ț (U+0219/U+021B) and cedilla ş/ţ (U+015F/U+0163) alike, plus ă, â, î. Every variant counts as a word character.
  • A hyphen inside a word, as in într-un or s-a, keeps the elided form as one word; a spaced dash used as punctuation separates words.

Characters are counted two ways: every character including spaces, and the length with whitespace removed. Sentences are counted by collapsing runs of terminal punctuation (., !, ?, ) so an ellipsis or a ?! combo counts as one boundary.

Example

The text:

Într-o țară frumoasă, copiii s-au jucat… Nu-i așa?

contains the words Într-o, țară, frumoasă, copiii, s-au, jucat, Nu-i, așa — eight words. The elided forms Într-o, s-au, and Nu-i each count as one word because the hyphen has no surrounding spaces.

Notes

  • Because the comma-below and cedilla letters are unified at the word-class level, a document that mixes both encodings still counts correctly.
  • Numbers like 2026 count as one word; a number glued to a suffix by a hyphen, such as anii-1990, stays one compound word.