Why is Japanese word counting only an estimate?

Japanese is written without spaces between words, and true word boundaries require a morphological dictionary. This tool uses script-boundary segmentation, a fast approximation that splits where the writing system changes, so the count is close but not exact.

How does script-boundary segmentation work?

The text is split wherever it switches between kanji, hiragana, katakana, and Latin runs. Because content words are often kanji or katakana and grammatical particles are hiragana, these boundaries line up reasonably well with real word breaks.

Why split hiragana runs further?

A single hiragana run frequently contains a content word plus particles, so the tool also splits long hiragana runs into shorter chunks. This brings the estimate closer to what a dictionary-based tokenizer would produce.

Will the count match a tool like MeCab?

Not exactly. MeCab uses a dictionary and statistical model, so it is more accurate. This browser tool trades that accuracy for speed and privacy, and is meant for quick estimates rather than linguistic analysis.

No. Segmentation and counting run locally in your browser. Nothing you paste leaves your device.

Japanese Word Counter

Email me this result

Get this tool's output sent to your inbox, plus one useful tool a week. No spam, unsubscribe any time.

Count words in Japanese text

Japanese has no spaces between words, so ordinary word counters report one giant “word” per line. This tool segments your text at script boundaries — the points where kanji, hiragana, katakana, and Latin runs meet — to produce a fast, reasonable estimate of the word count.

How it works

The text is scanned character by character and a boundary is inserted whenever the writing system changes:

東京  →  東京     (kanji run = one token)
へ行きました  →  へ / 行き / ました (split at script + run length)

Adjacent characters of the same script form a run; a script change starts a new token.
Long hiragana runs are split further, since they often pack a content word together with grammatical particles.
Punctuation and whitespace act as hard separators and are not counted.

Example and notes

Because content words tend to be kanji or katakana and particles tend to be hiragana, script boundaries approximate real word breaks well for everyday prose. It is an estimate, not a dictionary tokenizer: heavily inflected verbs or compound terms may be over- or under-split. For exact morphological counts use a MeCab-class tool. All processing happens in your browser.