This word counter is built for Portuguese, where pronouns frequently attach to the verb with a hyphen. A naive counter that splits on every hyphen would over-count those forms, so this tool keeps clitic chains together as single words to match how Portuguese grammar treats them.
How it works
The counter scans the text for runs of Latin letters, allowing internal hyphens
and apostrophes between letters. This means a form like fazê-la or the
mesoclitic dir-se-ia is matched as one token rather than two or three. Accented
characters such as á, ã, ê, and ç are word characters, so words like
coração are counted as a single word. Sentences come from terminal
punctuation, and unique words are compared case-insensitively.
Tips and example
Consider the sentence “Não consigo dar-lhe a resposta.” Here dar-lhe is one
word, giving a count of five words rather than six. The same applies to enclitic
forms like vê-los and dá-lo, and to contractions written with an apostrophe
such as d'água. The clitic/compound counter shows how many tokens contained a
hyphen, which is a quick way to confirm the grammar-aware behaviour is working on
your text.