Why is counting Thai words hard?

Thai is written with no spaces between words; spaces mark phrase or sentence breaks instead. A program must guess where one word ends and the next begins, which is an ambiguous segmentation problem rather than a simple space-split.

How does longest-match segmentation work?

Starting at each position, the tool looks for the longest entry in its Thai dictionary that matches from that point, emits it as a word, and moves past it. This greedy maximal-matching is a standard, fast Thai tokenization heuristic.

Longest-match is good for common vocabulary but is heuristic: it can mis-split rare words, names, or compounds not in the dictionary. Treat the count as a close estimate and review the segmented output for important text.

What about characters not in the dictionary?

When no dictionary word matches, the tool consumes one cluster (a base plus its stacked marks) as an unknown token so it never stalls. Latin words, numbers, and punctuation are counted as their own tokens.

No. Segmentation runs entirely in your browser against a built-in dictionary, so nothing is sent to any server.

Thai Word Counter — Gera Tools

Email me this result

Get this tool's output sent to your inbox, plus one useful tool a week. No spam, unsubscribe any time.

Thai is written without spaces between words — a space marks a phrase or sentence break, not a word boundary. That makes counting words a real segmentation problem. This free tool splits continuous Thai into words using a client-side longest-match dictionary and counts them.

How it works

The tool uses greedy maximal matching. Starting at each position in the text, it searches its built-in Thai dictionary for the longest word that matches from that point. It emits that word as one token and advances past it, then repeats from the new position. When no dictionary word matches, it consumes a single grapheme cluster — a base consonant plus any stacked vowels and tone marks — as an unknown token so the scan never stalls.

Latin runs, digit runs, and punctuation are each tokenised separately, and explicit spaces act as hard boundaries. The word count is the number of dictionary and unknown Thai tokens plus any Latin/numeric tokens.

Tips and notes

Longest-match is fast and works well for everyday vocabulary, but it is a heuristic: proper names, technical terms, and compounds that aren’t in the dictionary may be mis-split, and a single greedy choice can occasionally pick the wrong boundary. For important documents, scan the segmented word list shown below the count to confirm the split looks right. Everything runs locally in your browser, so your text is never uploaded.