Counting words in Hungarian needs awareness of its agglutinative grammar. A
single Hungarian word can stack many suffixes onto a stem, producing long forms
like házaitokban or the famous megszentségteleníthetetlenségeskedéseitekért.
Because these contain no spaces, they are exactly one orthographic word. This
counter applies the right boundary rules so compounds stay whole and you get
accurate word, character, and sentence totals.
How it works
The algorithm treats a word as a maximal run of letters and digits with internal hyphens and apostrophes allowed:
- It matches
[\p{L}\p{N}]runs, permitting an internal-or'between two such characters. - Every Hungarian accented letter (
á é í ó ö ő ú ü űand uppercase) is a Unicode letter and counts as a word character. - A hyphen inside a word, as in
dél-afrikai, keeps the form as one word; a spaced dash used as punctuation separates words.
Characters are counted two ways: every character including spaces, and the
length with whitespace removed. Sentences are counted by collapsing runs of
terminal punctuation (., !, ?, …) so an ellipsis or a ?! combo counts
as one boundary.
Example
The text:
A nagymama megszentségteleníthetetlenségeskedéseitekért aggódott.
contains four words: A, nagymama,
megszentségteleníthetetlenségeskedéseitekért, and aggódott. The very long
compound is counted as a single word because it has no internal spaces.
Notes
- Mixed Hungarian-English text and Latin product names are counted sensibly because Latin letters are also word characters.
- Numbers like
2026count as one word; a number glued to a suffix with a hyphen, such as2026-ban, stays one compound word as Hungarian expects.