Why is the byte count larger than the character count?

UTF-8 uses one byte only for ASCII characters. Accented Latin letters take two bytes, most CJK characters take three, and emoji take four, so any non-ASCII text has more bytes than characters.

What is the difference between code points and UTF-16 code units?

A code point is one logical character. UTF-16 code units are the 16-bit pieces JavaScript stores; astral characters like emoji use two units, which is why the JS .length can exceed the real character count.

How many bytes does an emoji take?

A basic emoji is a single code point above U+FFFF and uses four UTF-8 bytes. Compound emoji built from zero-width-joiner sequences are several code points and can take a dozen or more bytes.

Why does the byte count matter?

Database column limits, HTTP header sizes, cookie limits and SMS segment boundaries are all measured in bytes, not characters. A field that allows 255 bytes may hold far fewer accented or emoji characters.

Yes. The tool encodes your text with the browser's native TextEncoder, which produces the precise UTF-8 byte sequence defined by the Unicode standard, then reports its length.

What is the UTF-8 Byte Counter?

Count the exact UTF-8 byte length of any string, plus its code points, UTF-16 code units, and ASCII versus multi-byte character breakdown. Uses real TextEncoder byte counting — accurate for emoji and accented text. Free, instant. It runs free in your browser on Gera Tools, with nothing uploaded.

UTF-8 Byte Counter — Gera Tools

Name: UTF-8 Byte Counter
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

Counting real UTF-8 bytes

Characters and bytes are not the same thing. A tweet, a database column, an HTTP header, or an SMS segment is limited by bytes, while what you see on screen is characters. This counter reports the exact UTF-8 byte length of your text so you can tell whether it will actually fit.

How it works

The tool encodes your string with the browser’s built-in TextEncoder, which implements the official UTF-8 rules, and reports the length of the resulting byte array. UTF-8 is variable-width:

U+0000 – U+007F   1 byte   (ASCII)
U+0080 – U+07FF   2 bytes  (accented Latin, Greek, Cyrillic, Hebrew, Arabic)
U+0800 – U+FFFF   3 bytes  (most CJK, symbols)
U+10000 – U+10FFFF 4 bytes (emoji, rare scripts)

Alongside the byte total it counts code points (logical characters), UTF-16 code units (the JavaScript .length value), and splits characters into ASCII versus multi-byte so the difference between counts is obvious.

Where byte limits actually appear in practice

Knowing character count is rarely what you need for a technical constraint. Here are the byte-limited contexts where this counter is most useful:

Database columns. MySQL and PostgreSQL VARCHAR and TEXT columns can be defined by bytes or characters depending on the character set and storage engine. MySQL’s VARCHAR(255) in utf8mb4 means 255 bytes, not 255 characters. A single emoji (4 bytes) in a VARCHAR(1) field will fail or be truncated.

HTTP headers. The HTTP/1.1 specification does not define a hard limit, but many servers (Apache, Nginx, IIS) have default limits around 8 KB per header line. Cookie values, Authorization tokens, and custom headers all count toward that limit in bytes.

SMS text messages. A standard SMS segment is 160 characters when using the GSM 7-bit encoding (which covers basic Latin, digits, and some symbols), but only 70 characters when any character falls outside that set and forces the message into UCS-2 (a UTF-16 variant). Non-ASCII text — accented characters, emoji, Arabic, Chinese — dramatically reduces how much fits in one segment.

File name limits. On most Linux/macOS filesystems, file names are limited to 255 bytes, not 255 characters. A filename in Chinese using 3-byte UTF-8 characters would be capped at 85 characters, not 255.

API limits. Many APIs that advertise a “500 character” limit measure in bytes, not characters. Twitter historically counted characters using a code-unit model; others use byte counts. When in doubt, measure bytes.

Worked examples

Text	Characters (code points)	JS .length	UTF-8 bytes
`Hello`	5	5	5
`café`	4	4	5 (é = 2 bytes)
`こんにちは`	5	5	15 (each = 3 bytes)
`👋🌍`	2	4 (surrogates)	8 (each emoji = 4 bytes)
`👨‍👩‍👧`	1 (visual)	8	18 (ZWJ sequence)

The last row is especially striking: a family emoji that looks like one character is actually multiple code points joined by zero-width joiners, consuming 18 UTF-8 bytes.

The JavaScript .length trap

In JavaScript, string.length returns the number of UTF-16 code units, not characters or bytes. For ASCII text these are the same. For emoji and other characters above U+FFFF:

"😀".length       // → 2  (surrogate pair)
"😀".codePointAt(0) // → 128512  (real code point)
// TextEncoder("utf-8").encode("😀").length → 4

This is why this tool reports all three counts. When you see “max 140 characters” in a spec and are unsure which kind of characters they mean, test with an emoji: if the limit drops by 2 for each emoji, they are counting UTF-16 code units; if it drops by 1, they are counting code points; if it drops by 4, they are counting bytes.

Notes

The byte count is exact, computed from the native TextEncoder which implements the Unicode standard precisely. All processing happens locally in your browser; the text you enter is never uploaded.