Ukrainian Cyrillic poses a counting subtlety that trips up naive tools: the
apostrophe is a letter, not punctuation. In words like м'яч, об'єкт, and
п'ять it separates a consonant from an iotated vowel and belongs inside the
word. A counter that splits on the apostrophe would wrongly report м'яч as two
words. This tool keeps the apostrophe inside words and gives accurate word,
character, and sentence totals.
How it works
The algorithm treats a word as a maximal run of letters and digits with internal hyphens and apostrophes allowed:
- It matches
[\p{L}\p{N}]runs, permitting an internal-,', or the typographic'between two such characters. - An apostrophe inside a word, as in
п'ять, keeps the word whole. Both the straight ASCII apostrophe and the right single quotation mark are recognised. - A hyphen inside a word, as in
будь-якийorпо-українськи, keeps the compound as one word; a spaced dash used as punctuation separates words. - Ukrainian-specific letters
ї,і,є, andґare Unicode letters and count as word characters.
Characters are counted two ways: every character including spaces, and the
length with whitespace removed. Sentences are counted by collapsing runs of
terminal punctuation (., !, ?, …) so an ellipsis or a ?! combo counts
as one boundary.
Example
The text:
П'ять синьо-жовтих м'ячів… Чи не так?
contains the words П'ять, синьо-жовтих, м'ячів, Чи, не, так — six
words. The apostrophe keeps П'ять and м'ячів whole, and the hyphen keeps
синьо-жовтих as a single compound.
Notes
- Mixed Ukrainian-English text and Latin product names are counted sensibly because Latin letters are also word characters.
- Numbers like
2026count as one word; a number glued to a suffix by a hyphen, such as90-ті, stays one compound word.