The Robots Exclusion Protocol is how a site tells crawlers what they may fetch
and what may appear in a search index. It splits across three surfaces — the
robots.txt file, the HTML <meta name="robots"> tag, and the X-Robots-Tag
HTTP response header — and not every directive is supported everywhere. This
reference lists each directive, what it does, and which engines honour it.
How it works
Crawling and indexing are two separate stages. robots.txt is consulted
before a URL is fetched, so it governs crawling only. Directives like
noindex and nofollow are read after the page is fetched, from the meta tag
or the X-Robots-Tag header, so they govern what happens to the content once a
bot already has it.
A critical consequence: a page that is Disallow-ed in robots.txt is never
fetched, so the bot never sees a noindex tag inside it. Such a page can still
appear in results as a bare URL. To reliably remove a page from an index, leave
it crawlable and serve a noindex directive.
Major engines extend the original 1994 standard with pattern matching: * for
any character sequence and $ to anchor the URL end. These are honoured by
Google and Bing but are not part of the formal RFC 9309 specification.
Tips and examples
- To keep a page out of search, use
noindex(meta or header), notDisallow. Disallow:with an empty value allows everything;Disallow: /blocks the whole site for that user-agent.- Combine directives in one tag:
<meta name="robots" content="noindex, nofollow">. - Use
X-Robots-Tagfor non-HTML files (PDFs, images) where you cannot add a meta tag. - Always test changes in Google Search Console’s robots.txt tester before
deploying — a stray
Disallow: /can deindex an entire site.