How is UTF-32 different from UTF-16 and UTF-8?

UTF-32 uses a fixed four bytes for every code point, so indexing is trivial but it wastes space for ASCII. UTF-8 and UTF-16 are variable width and far more compact for typical text, which is why UTF-32 is rare outside internal processing.

Why does each character take exactly four bytes?

Unicode code points range up to U+10FFFF, which fits in 21 bits. UTF-32 rounds that up to a fixed 32-bit (four-byte) unit per code point, so every character is the same width regardless of value.

How are emoji handled?

An emoji is a single code point above U+FFFF. The tool reads it as one code point using the string iterator (which combines surrogate pairs) and encodes it as one four-byte UTF-32 value, not two.

What is the difference between LE and BE here?

Little-endian writes the four bytes least-significant first; big-endian writes most-significant first. U+0041 is 41 00 00 00 in LE and 00 00 00 41 in BE. The code point value is identical, only byte order differs.

Does the output include a BOM?

No. The viewer shows only the encoded text. A UTF-32 file may begin with FF FE 00 00 (LE) or 00 00 FE FF (BE) as a byte order mark, which you would add separately.

What is the UTF-32 Hex Viewer?

Expands any text into its UTF-32 encoded byte sequence, with each Unicode code point shown as a fixed four-byte hexadecimal value in little-endian or big-endian order. Surrogate pairs are correctly combined. Runs in your browser. It runs free in your browser on Gera Tools, with nothing uploaded.

UTF-32 Hex Viewer — Gera Tools

Name: UTF-32 Hex Viewer
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

UTF-32 is the simplest Unicode encoding: every code point is stored as a fixed four-byte value. This viewer expands any text into that fixed-width byte stream in hexadecimal, in either little-endian or big-endian order.

How it works

The tool iterates over the string by Unicode code point (using the string iterator, which correctly joins surrogate pairs into a single value above U+FFFF). Each code point cp is split into four bytes:

b0 =  cp        & 0xFF   (least significant)
b1 = (cp >> 8)  & 0xFF
b2 = (cp >> 16) & 0xFF
b3 = (cp >> 24) & 0xFF
LE = b0 b1 b2 b3
BE = b3 b2 b1 b0

Because the value never exceeds U+10FFFF, the most significant byte is always 00, but UTF-32 still reserves all four bytes for alignment and fixed indexing.

Example

The letter A (U+0041) becomes 41 00 00 00 in little-endian or 00 00 00 41 in big-endian. A rocket emoji (U+1F680) becomes the single value 80 F6 01 00 in little-endian — one code point, four bytes — whereas in UTF-16 the same character would need a two-unit surrogate pair. UTF-32 is convenient for indexing because character count always equals byte count divided by four.

The key advantage of UTF-32: fixed width

UTF-8 uses 1 to 4 bytes per character; UTF-16 uses 2 or 4 bytes. Both require code to account for variable-width sequences to correctly navigate a string. UTF-32 eliminates that complexity entirely: every character is exactly four bytes, so finding the Nth character is a multiply by four, not a scan. This makes string indexing and substring operations O(1) rather than O(n). For this reason, some internal processing engines — notably Python 3’s internal string representation for large code points, and many regular expression engines — use UTF-32 (or a UCS-4 equivalent) for their in-memory strings even if the external format is UTF-8.

When you actually encounter UTF-32 in practice

UTF-32 is rare in files and transmission because it wastes space: an ASCII file that takes 1 KB in UTF-8 takes 4 KB in UTF-32. However, you will see it in:

C/C++ wchar_t on Linux/macOS — on POSIX systems wchar_t is 32 bits, meaning wide-character strings are effectively UTF-32. Functions like wcslen and wcscmp operate on this fixed-width encoding.
Python’s internal repr for strings — for strings containing characters above U+00FF, Python may use a UCS-4 (UTF-32-like) internal buffer.
Some binary data formats and custom databases — fixed-width fields for character data occasionally use UTF-32 to avoid variable-length complexity.
Comparing encodings — understanding what UTF-32 looks like clarifies why UTF-8 and UTF-16 made the trade-offs they did.

UTF-32 vs UTF-16 vs UTF-8: a concrete comparison

The phrase caf followed by é (U+00E9) — the word “café”:

Encoding	Bytes for “café”	Notes
UTF-8	5 bytes	ASCII chars are 1 byte; é is 2 bytes
UTF-16 LE	8 bytes	every char is 2 bytes for this range
UTF-32 LE	16 bytes	every char is 4 bytes, always

An emoji like a thumbs-up (U+1F44D) shows the real cost of fixed width:

Encoding	Bytes for single emoji	Notes
UTF-8	4 bytes	encoded as a 4-byte sequence
UTF-16 LE	4 bytes	surrogate pair, two 2-byte units
UTF-32 LE	4 bytes	single 4-byte value

For emoji and rare CJK characters in the astral plane, all three encodings use four bytes — UTF-32’s overhead is primarily on ASCII and common Latin text.