What is a Unicode code point?

A code point is the numeric identity Unicode assigns to a character, written like U+0041 for the letter A. It is independent of how the character is stored in bytes, which is what UTF-8 or UTF-16 decide.

Does it handle emoji and astral characters?

Yes. The tool iterates real code points, so a character above U+FFFF such as an emoji is reported as a single code point rather than as the two UTF-16 surrogate halves.

What is the difference between a code point and a code unit?

A code unit is a fixed-size storage piece — 16 bits in UTF-16, 8 bits in UTF-8. A single code point can need several code units. This tool reports code points, the logical characters.

Which notation should I use?

Use U+XXXX for documentation and Unicode references, hexadecimal 0x form for many programming contexts, and decimal when an API or data format expects raw integers.

Why do some characters show as combined?

Some visible glyphs are made of a base character plus combining marks, each its own code point. The tool lists each code point separately, which is the accurate low-level view even if they render as one symbol.

What is the UTF-8 to Unicode Code Points?

Convert any UTF-8 text into a list of Unicode code points in U+XXXX, hexadecimal or decimal notation. Correctly handles emoji and astral characters via real code-point iteration, not UTF-16 code units. Free and instant. It runs free in your browser on Gera Tools, with nothing uploaded.

UTF-8 to Unicode Code Points

Name: UTF-8 to Unicode Code Points
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

From text to code points

This tool breaks a string into the individual Unicode code points behind it. Where a byte-counter cares about storage and an editor cares about glyphs, this view shows the logical characters Unicode actually assigns numbers to — useful when debugging encoding issues, building escape sequences, or inspecting mysterious whitespace.

How it works

JavaScript strings are stored as UTF-16, so a naive character loop would split astral characters (anything above U+FFFF, like most emoji) into two surrogate halves. To avoid that, the tool iterates the string with a code-point-aware loop and reads each character’s codePointAt(0). Each value is then formatted in the notation you choose:

U+XXXX  -> "U+" + hex, upper-case, zero-padded to 4 digits
0x form -> "0x" + hex
decimal -> the plain integer

For example the earth emoji is the single code point U+1F30D, not the surrogate pair U+D83C U+DF0D that UTF-16 would store it as.

Practical uses

Debugging escape sequences in code. When writing a Python \uXXXX escape, a Java string literal with \uXXXX, or a CSS content: "\XXXX" value, you need the U+XXXX notation. Paste the character here to get the exact code-point number rather than guessing from a Unicode chart.

Identifying invisible characters. Zero-width spaces (U+200B), zero-width non-joiners (U+200C), left-to-right marks (U+200E), and various other invisible code points can cause mysterious string-equality failures in code. Pasting suspect text here lists every code point, making invisible characters visible.

Understanding emoji composition. Many emoji that appear as a single glyph are actually sequences of multiple code points: a base emoji followed by a modifier or skin-tone selector (U+1F3FB through U+1F3FF), or a family emoji built from multiple people characters joined by zero-width joiners (U+200D). This tool shows each code point separately, which is the accurate representation even when the renderer collapses them into one symbol.

Generating JSON or XML escape sequences. JSON requires \uXXXX escapes for non-ASCII characters in some contexts (especially characters above U+FFFF, which need two \u surrogate escapes in JSON). The U+XXXX output from this tool maps directly to those values.

When visible character count differs from code-point count

A Thai vowel mark, a Hebrew dagesh, a combining accent, or a diacritic in Latin script are all separate code points that appear to modify the preceding base character. Paste é composed with a combining acute accent (as opposed to the precomposed U+00E9) and the tool lists two code points: e (U+0065) followed by the combining acute accent (U+0301). This is the NFD (Normalization Form D) decomposed representation. The NFC precomposed form packs them into a single U+00E9. String comparison bugs often arise from mixing NFD and NFC forms — seeing the code points directly makes it obvious why two strings that look identical are not equal.

Tips

If a single visible symbol reports as several code points, it is almost certainly a base letter plus combining diacritical marks, or an emoji built from a zero-width-joiner sequence — both are legitimate multi-code-point clusters. Use the U+XXXX notation when writing documentation or looking characters up in the Unicode charts, and decimal when feeding an API that expects integers. To go the other way, use the companion code-points-to-UTF-8 tool.