Look up character encodings and their aliases
A character set, or charset, is the mapping between the bytes stored or transmitted and the characters a person reads. Get it wrong and text turns into mojibake — garbled symbols where accented letters or CJK characters should be. This reference covers the charsets in the IANA registry that you actually meet on the web and in files, with each one’s canonical name, the preferred MIME name to put in headers, its aliases, and the spec that defines it.
How it works
Every charset has a canonical name registered with IANA and a preferred MIME name that protocols should emit. Many also carry historical aliases — for instance latin1, l1 and cp819 all resolve to ISO-8859-1, and cp1252 resolves to windows-1252. This tool searches all of those, so typing a legacy alias takes you to the canonical encoding. To declare an encoding you place it in an HTTP header or an HTML tag:
Content-Type: text/html; charset=UTF-8
<meta charset="UTF-8">
The byte width differs by family: UTF-8 is variable (1-4 bytes) and ASCII-compatible; UTF-16 uses 16-bit code units with surrogate pairs; the ISO-8859 and Windows code pages are single-byte; and the East-Asian encodings (Shift_JIS, GBK, GB18030, Big5, EUC-KR) are multi-byte.
Tips and notes
- Always prefer UTF-8 for new content. It is the web default and covers all of Unicode. The other entries here are mainly for reading legacy data.
- Use the preferred MIME name in headers, not an alias, so every client decodes consistently.
- Labelled iso-8859-1 becomes windows-1252 in HTML5 decoders — they are a superset that fills the 0x80-0x9F range, which is why smart quotes and the euro sign survive.
- GB18030 covers all of Unicode and is mandated for software sold in China, so it is the safe legacy choice for Simplified Chinese.
- Fixing mojibake means decoding the bytes with their original charset and re-saving as UTF-8 — do not just re-label the file, or the bytes stay wrong.