What is a Unicode code point?

A code point is the abstract numeric identity of a character, written like U+0041 for the letter A. It is independent of how the character is stored in bytes. Unicode defines over a million possible code points across many scripts.

Why does an emoji count as one code point but two UTF-16 units?

Code points above U+FFFF live in the astral planes and are stored in UTF-16 as a surrogate pair, two 16-bit units. This tool iterates the string by code point, so a single emoji is one row even though the UTF-16 column shows two units.

What does the general category mean?

Every Unicode character has a two-letter category such as Lu (uppercase letter), Nd (decimal digit), or Sc (currency symbol). It classifies the character's role and is used by regular expressions, word breaking, and validation rules.

How is the UTF-8 encoding shown?

The tool computes the actual UTF-8 byte sequence for each code point. ASCII characters are one byte, most Latin and Greek letters are two, most other scripts three, and emoji four. The bytes are shown in hexadecimal.

Can it identify the Unicode block?

Yes. The code point is matched against the standard block ranges, so you can see whether a character belongs to Basic Latin, Cyrillic, Arabic, CJK Unified Ideographs, the Emoticons block, and so on.

What is the Unicode Code Point Inspector?

Inspect any string character by character: see each Unicode code point in U+ notation, its general category, its Unicode block, and its UTF-8 and UTF-16 byte encoding. Handles emoji and astral characters. Runs in your browser. It runs free in your browser on Gera Tools, with nothing uploaded.

Unicode Code Point Inspector

Name: Unicode Code Point Inspector
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

Mystery characters in text — an invisible control byte, a look-alike Cyrillic letter, or an emoji that breaks a database column — are easy to misread. This inspector breaks any string into its individual Unicode code points and shows the full identity of each one.

How it works

The tool iterates the string by code point rather than by UTF-16 unit, so emoji and other astral characters are treated as single characters. For each one it reports:

the code point in U+XXXX notation via codePointAt,
the general category (such as Lu, Nd, or So), derived from the browser’s Unicode property escapes like \p{Lu},
the Unicode block, matched against the standard range table,
the UTF-8 bytes, computed directly from the code point, and the UTF-16 units that make up the JavaScript string.

When to reach for this tool

Debugging “same string, different comparison result”: Two strings that look identical in a text editor can have different byte sequences. The classic case is composed vs decomposed Unicode: é as a single precomposed U+00E9 versus e + U+0301 (combining acute). This inspector shows each code point, revealing the discrepancy immediately. Copy both strings into the inspector and compare the code-point lists line by line.

Finding invisible characters: Zero-width spaces (U+200B), word joiners (U+2060), and left-to-right marks (U+200E) are invisible in editors but affect string operations, sorting, and rendering. The inspector labels them by name and category so they cannot hide.

Database column truncation: a MySQL column defined as VARCHAR(255) means 255 characters in some configurations and 255 bytes in others. When a string containing 4-byte emoji triggers a truncation error, the UTF-8 byte count for each character makes it easy to see which characters are consuming the most storage.

Detecting look-alike homoglyphs: the Latin letter a (U+0061) and the Cyrillic letter а (U+0430) look identical in many fonts. The inspector shows their distinct code points and blocks, which is the first step in detecting homoglyph phishing or spoofed domain names.

Understanding emoji sequences: a single rendered emoji can be a sequence of multiple code points — a base character, a variation selector, skin-tone modifier, and zero-width joiners linking multiple emoji into one glyph. The inspector lists every component, making it possible to understand why string.length reports 11 for what appears to be a single family emoji.

UTF-8 byte encoding reference

Code point range	UTF-8 bytes	Examples
U+0000–U+007F	1 byte	ASCII characters
U+0080–U+07FF	2 bytes	Latin accents, Greek, Cyrillic, Hebrew
U+0800–U+FFFF	3 bytes	CJK, most of the BMP
U+10000–U+10FFFF	4 bytes	Emoji, historic scripts, rare CJK

Tips

Use the UTF-8 column to debug encoding bugs: a character that should be one byte but shows up as several often means text was double-encoded (e.g. UTF-8 bytes interpreted as Latin-1 and then re-encoded as UTF-8). The category column helps when writing regular expressions: \p{Nd} matches any decimal digit across all scripts, not just 0–9. Watch for control characters (category Cc), which display as a ctrl marker here because they have no visible glyph but can corrupt files and break parsers. All inspection runs locally in your browser.