What is a byte-order mark?

A byte-order mark is the Unicode character U+FEFF placed at the very start of a text stream. Its encoded bytes reveal both the encoding and, for multi-byte encodings, whether the data is big-endian or little-endian.

Should I add a BOM to UTF-8 files?

Usually no. UTF-8 has no byte-order to mark, and a leading EF BB BF can break shell scripts, JSON parsers, PHP output and CSV imports. Most tools expect UTF-8 without a BOM, so omit it unless a specific Windows program requires one.

How do UTF-16 LE and BE differ?

UTF-16 little-endian starts with the bytes FF FE and stores the low byte of each unit first; big-endian starts with FE FF and stores the high byte first. The BOM lets a reader detect the order so the rest of the file is decoded correctly.

Why might a file show a wrong character at the start?

A stray BOM that the application did not strip shows up as the visible sequence ï»¿ (for a UTF-8 BOM mis-read as Latin-1) or a blank character. Detecting and removing the BOM, or telling the parser to expect one, fixes it.

Can UTF-8 and UTF-16 BOMs be confused?

Not if you check enough bytes. A UTF-8 BOM is three bytes EF BB BF, while UTF-16 BOMs are two bytes (FF FE or FE FF) and the UTF-32 LE BOM is FF FE 00 00. Always test the longer sequences first so UTF-32 LE is not mistaken for UTF-16 LE.

What is the BOM Byte-Order Mark Reference?

Reference for text encoding byte-order marks with hex byte sequences and encoding identification rules. Paste the leading bytes of a file and detect which BOM, if any, it starts with. It runs free in your browser on Gera Tools, with nothing uploaded.

BOM Byte-Order Mark Reference

Name: BOM Byte-Order Mark Reference
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

Get one useful tool a week

Like this tool? Enter your email and we'll send you one genuinely useful Gera tool a week — plus a link to come back to this one. No spam, one-click unsubscribe any time.

Identify a file by its first bytes

A byte-order mark (BOM) is the Unicode code point U+FEFF written at the start of a file. Because that single code point encodes to different byte patterns in different encodings, the leading bytes act as a signature: they tell a reader whether the text is UTF-8, UTF-16 or UTF-32, and for the multi-byte forms, which endianness. This reference lists every BOM with its exact hex bytes and lets you paste a file’s opening bytes to detect which one it is.

Complete BOM reference

Encoding	Hex bytes	Notes
UTF-8	EF BB BF	Optional in UTF-8; widely discouraged
UTF-16 LE	FF FE	Preferred UTF-16 form on Windows
UTF-16 BE	FE FF	Network byte order
UTF-32 LE	FF FE 00 00	Test before UTF-16 LE to avoid confusion
UTF-32 BE	00 00 FE FF	Rarely used
UTF-7	2B 2F 76	Followed by 38, 39, 2B, or 2F

How the detector works

The detector reads your hex bytes and tests them against the known signatures, longest first so it cannot mistake a four-byte UTF-32 LE mark (FF FE 00 00) for a two-byte UTF-16 LE mark (FF FE). If the bytes match a signature, the encoding and endianness are reported; if they do not, the file has no BOM — which for UTF-8 is the normal, recommended state. Endianness matters because UTF-16 and UTF-32 store each code unit across multiple bytes, and the BOM fixes whether the most or least significant byte comes first.

Why “weird first character” bugs happen

A BOM that the reading application does not strip appears as visible garbage at the start of the content. A UTF-8 BOM (EF BB BF) read as Latin-1 or Windows-1252 displays as the three-character sequence ï»¿. A UTF-16 LE BOM (FF FE) read as Windows-1252 shows up as ÿþ. These are among the most common character-encoding bugs in file interchange.

JSON parsers that strictly follow RFC 8259 reject a leading BOM (the spec requires no BOM in JSON). PHP’s output buffer sends the BOM as literal bytes before any HTTP headers in some configurations. Shebang lines (#!/usr/bin/env python3) on Unix are broken by a leading BOM because the kernel does not recognise EF BB BF #! as a valid interpreter directive.

Practical guidance

Prefer UTF-8 without a BOM for source code, JSON, CSV, shell scripts, and anything consumed by Unix tools.
When targeting Windows-only tools (some versions of Excel, Notepad before Windows 10), adding a UTF-8 BOM can help them detect the encoding correctly — do it deliberately and document it.
For UTF-16 and UTF-32, always include a BOM or agree the byte order out of band, since there is no other reliable runtime way to detect it.
When debugging encoding issues, paste the first few bytes into this tool before trying anything else.