What is a Unicode general category?

Every Unicode code point is assigned exactly one general category — a two-letter code such as Lu (uppercase letter) or Nd (decimal digit). The first letter names the broad group (L letters, N numbers, P punctuation, and so on) and the second narrows it down.

How do I match a category in a regular expression?

In engines that support Unicode property escapes, use \p{Code}. For example \p{Lu} matches any uppercase letter and \p{Nd} matches any decimal digit. You can also match a whole group with \p{L} for all letters.

What is the difference between Nd, Nl and No?

Nd is decimal digits that combine into base-ten numbers (0–9 and their script variants). Nl is letter-like numbers such as Roman numerals. No is other numeric characters like fractions and superscripts that are not decimal digits.

Why do some categories like Cs and Co matter?

Cs marks surrogate code points used by UTF-16 and Co marks private-use characters with no standard meaning. Knowing them helps you filter out code points that should not appear in normal text or that carry app-specific meaning.

What is the Unicode General Categories?

Searchable Unicode general category reference with two-letter code, full name and example characters. Look up Lu, Ll, Nd, Po and every other category, plus the regex property that matches it. It runs free in your browser on Gera Tools, with nothing uploaded.

Unicode General Categories

Name: Unicode General Categories
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

Look up any Unicode category

The Unicode standard tags every code point with a general category — a compact two-letter code that says what kind of character it is. Lu is an uppercase letter, Nd a decimal digit, Po other punctuation, Zs a space separator. The first letter is the major class (Letter, Mark, Number, Punctuation, Symbol, Separator, Other) and the second letter refines it. This reference lists all 30 categories with names, groups, examples and the regex property that matches them.

The seven major classes and their subcategories

L — Letter (5 subcategories)

Lu Uppercase letter: A, Б, Α, Ж
Ll Lowercase letter: a, б, α, ж
Lt Titlecase letter: DŽ, Lj, Nj (rare mixed-case digraphs)
Lm Modifier letter: spacing modifier letters used phonetically
Lo Other letter: CJK ideographs, Arabic letters, Korean syllables

M — Mark (3 subcategories)

Mn Nonspacing mark: combining accents, diacritics (U+0301 combining acute)
Mc Spacing combining mark: South Asian vowel signs that take space
Me Enclosing mark: enclosing circles, squares around base characters

N — Number (3 subcategories)

Nd Decimal digit: 0–9 in any script (Arabic-Indic, Devanagari digits, etc.)
Nl Letter number: Roman numerals Ⅰ–Ⅻ, ancient Greek acrophonic numerals
No Other number: fractions ½ ¼, superscript ², subscript ₂, enclosed numbers

P — Punctuation (6 subcategories): Pc connector, Pd dash, Ps open, Pe close, Pi initial quote, Pf final quote, Po other

S — Symbol (4 subcategories): Sm math, Sc currency, Sk modifier, So other

Z — Separator (3 subcategories): Zs space, Zl line separator, Zp paragraph separator

C — Other (5 subcategories): Cc control, Cf format, Cs surrogate, Co private use, Cn unassigned

How it works

The general category is a fixed property in the Unicode Character Database. When you ask a regex engine for \p{Lu}, it consults this same classification and matches every code point whose category is Lu. Major-class escapes work too: \p{L} matches Lu, Ll, Lt, Lm and Lo together. The categories are mutually exclusive — a character belongs to exactly one — which is why they are reliable building blocks for tokenisers, validators and text filters.

Practical regex patterns using categories

# Match any letter in any script (replaces [A-Za-z] for international text)
\p{L}

# Match any decimal digit in any script (replaces [0-9])
\p{Nd}

# Match any whitespace separator
\p{Zs}

# Strip all combining marks from text (use after NFD normalisation)
[^\p{Mn}]  (keep only non-combining-mark characters)

# Match any punctuation
\p{P}

# Match currency symbols in any currency
\p{Sc}

# Detect control/format characters that should not appear in clean text
\p{C}

Support for Unicode property escapes (\p{...}) in JavaScript requires the u or v flag on the regex: /\p{L}+/u.

Tips for avoiding common mistakes

When validating “letters and digits”, prefer \p{L} and \p{Nd} over the ASCII-only [A-Za-z0-9] so you accept international names, CJK usernames, and Devanagari text. Strip layout noise by excluding \p{C} (the Other group). Remember that Nd only covers decimal digit numerals — Roman numerals are Nl and fractions are No, so a “numbers” filter using only Nd will miss them. The surrogate (Cs) and unassigned (Cn) categories should never appear in well-formed text, so their presence often indicates encoding corruption or a bug in string handling.