What is the difference between a canonical name and a preferred MIME name?

The canonical name is how IANA officially registers the charset. The preferred MIME name is the label you should actually emit in a Content-Type header or meta charset tag. They are often identical, but not always — always use the preferred MIME name in protocols.

What charset should I use for new content?

UTF-8. It encodes every Unicode character, is backward-compatible with ASCII, and is the default the web platform and JSON assume. There is rarely a good reason to choose anything else for new work.

Is latin1 the same as ISO-8859-1?

Yes — latin1 is an alias of ISO-8859-1. Note that HTML5 decoders treat content labelled iso-8859-1 as windows-1252, which is a superset, so smart quotes and the euro sign still render.

Why does my Japanese or Chinese text show mojibake?

The bytes were encoded with one charset (e.g. Shift_JIS or GBK) but decoded as another (often UTF-8 or windows-1252). Identify the original encoding, decode with it, then re-save as UTF-8 to fix it permanently.

Does this tool transmit anything I type?

No. Searching, filtering and copying all happen locally in your browser. It is a static reference of the IANA registry with no network calls.

What is the Character Set (Charset) Reference?

A searchable reference for the IANA character sets registry — UTF-8, UTF-16, ISO-8859-1 (latin1), windows-1252, Shift_JIS, GB18030, Big5 and more — with canonical name, preferred MIME name, aliases and the encoding spec. Maps legacy aliases to canonical names. Runs in your browser. It runs free in your browser on Gera Tools, with nothing uploaded.

Character Set (Charset) Reference

Name: Character Set (Charset) Reference
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

Get one useful tool a week

Like this tool? Enter your email and we'll send you one genuinely useful Gera tool a week — plus a link to come back to this one. No spam, one-click unsubscribe any time.

Look up character encodings and their aliases

A character set, or charset, is the mapping between the bytes stored or transmitted and the characters a person reads. Get it wrong and text turns into mojibake — garbled symbols where accented letters or CJK characters should be. This reference covers the charsets in the IANA registry that you actually meet on the web and in files, with each one’s canonical name, the preferred MIME name to put in headers, its aliases, and the spec that defines it.

How it works

Every charset has a canonical name registered with IANA and a preferred MIME name that protocols should emit. Many also carry historical aliases — for instance latin1, l1 and cp819 all resolve to ISO-8859-1, and cp1252 resolves to windows-1252. This tool searches all of those, so typing a legacy alias takes you to the canonical encoding. To declare an encoding you place it in an HTTP header or an HTML tag:

Content-Type: text/html; charset=UTF-8

<meta charset="UTF-8">

The byte width differs by family: UTF-8 is variable (1-4 bytes) and ASCII-compatible; UTF-16 uses 16-bit code units with surrogate pairs; the ISO-8859 and Windows code pages are single-byte; and the East-Asian encodings (Shift_JIS, GBK, GB18030, Big5, EUC-KR) are multi-byte.

Tips and notes

Always prefer UTF-8 for new content. It is the web default and covers all of Unicode. The other entries here are mainly for reading legacy data.
Use the preferred MIME name in headers, not an alias, so every client decodes consistently.
Labelled iso-8859-1 becomes windows-1252 in HTML5 decoders — they are a superset that fills the 0x80-0x9F range, which is why smart quotes and the euro sign survive.
GB18030 covers all of Unicode and is mandated for software sold in China, so it is the safe legacy choice for Simplified Chinese.
Fixing mojibake means decoding the bytes with their original charset and re-saving as UTF-8 — do not just re-label the file, or the bytes stay wrong.