What is robots.txt?
Crawl, indexing and structured data
A robots.txt file is a plain text file at the root of a website that tells compliant crawlers which URL paths they may request. It manages crawling, not privacy, security or removal. Blocking a path does not delete a page or guarantee it stays out of search results, because a blocked URL can still be listed if other pages link to it. The rules were standardised as RFC 9309 in September 2022. For genuinely private content, use authentication or removal, not robots.txt.
What this means
Think of robots.txt as a sign at the reception desk that asks visiting robots which corridors to avoid. It is a request, not a locked door. Well behaved crawlers from the major search engines read the file before they fetch anything else and respect what it says. Crawlers that choose to misbehave can ignore it entirely.
The file lives at one fixed location, the root of each host, for example https://www.example.com/robots.txt. It controls requests, not possession. It can ask a crawler not to fetch a page, but it cannot stop a page being known about, and it cannot remove a page that has already been indexed. If another site links to a blocked URL, that URL can still appear in results, usually with no description because the crawler was never allowed to read the contents.
Why it matters
For most small sites, robots.txt is something you set once and rarely touch. It becomes important as a site grows, because content management systems generate large numbers of near duplicate and utility URLs, such as filtered category views, internal search results and printable versions. Left unmanaged, these can absorb a crawler's time on pages that bring no value.
The bigger commercial point is that robots.txt is frequently misunderstood, and the misunderstandings cause real harm. Teams use it to try to hide staging sites, to remove pages from results, or to keep content private, and none of those things work reliably. A single careless line can also block an entire site from being crawled, which can quietly remove it from search. Knowing what the file does, and what it does not do, prevents both accidental self sabotage and false confidence.
How it works
A standard, not just a convention
The robots exclusion method was first proposed by Martijn Koster in 1994 and was used informally for decades. It was formally standardised as RFC 9309, the Robots Exclusion Protocol, published by the Internet Engineering Task Force in September 2022. The standard sets out how the file should be parsed, how errors should be handled and how the file may be cached. Crawlers should not rely on a cached copy for more than 24 hours, and they must be able to parse at least 500 kibibytes of the file. The standard is explicit that the protocol is not a substitute for valid content security measures, and that listing paths in the file exposes them publicly, which makes them easier to discover.
Scope, user agents and rules
The rules in a robots.txt file apply only to the host, protocol and port where the file is served. A file at https://www.example.com/robots.txt does not govern a subdomain such as shop.example.com, nor the http version if your canonical site is https. Inside the file, you group rules by user agent, the name a crawler announces itself by. Each group contains Allow and Disallow rules that name URL paths. When more than one rule matches a URL, the most specific, longest matching rule wins.
The sitemap line
You can include one or more Sitemap lines pointing to the full URL of an XML sitemap. This is supported by the major search engines and is a convenient way to advertise where your sitemap lives, although it does not force anything to be crawled or indexed.
How it interacts with other controls
Crawling and indexing are separate steps, and robots.txt only touches the first. This creates a well known sequencing trap. If you want a page removed from results, you apply a noindex instruction, but the crawler has to be allowed to fetch the page to see that instruction. If you block the same URL in robots.txt, the crawler never reads the page, never sees the noindex, and the URL can linger in results. The correct order is to allow crawling, let the noindex be seen and the page drop out, and only then consider blocking the path if you want to save crawl effort.
Managing AI crawlers
Since 2023, the major AI companies have published their own user agent names, so you can make decisions about them separately. Google introduced the Google-Extended control token in September 2023. Blocking Google-Extended asks Google not to use your content to train its generative models, but it does not remove you from Google Search and does not stop your content appearing in AI Overviews, because those draw on the normal search index controlled by Googlebot. OpenAI runs GPTBot for training and a separate agent for its search feature, so blocking one does not affect the other. Anthropic runs ClaudeBot and related agents, and Common Crawl runs CCBot. As with all robots.txt rules, these are honoured voluntarily. They are a clear signal to reputable operators, not an enforcement mechanism.
Examples
A retailer with faceted navigation finds that filtered combinations such as colour, size and sort order generate tens of thousands of near identical URLs. They use Disallow rules to keep crawlers away from the parameter heavy paths, so that crawl effort concentrates on real product and category pages.
A business launches a new section and wants old, thin placeholder pages out of search. They first confirm those URLs are not blocked in robots.txt, apply noindex, wait for the pages to drop out, then add a Disallow rule to keep crawlers off the path in future.
A marketing team wants its blog to be quotable by AI search tools but does not want the wider site used for model training. They allow the search and retrieval agents, opt out of training through Google-Extended and explicit Disallow rules on training crawlers, and accept that a firewall is needed for any crawler that ignores the file.
Common misunderstandings
The most common mistake is believing robots.txt hides content. It does not. A blocked URL can still be listed if it is linked from elsewhere, shown without a useful description.
A second is assuming that Disallow means de index. It does not. Blocking crawling can actually prevent a page being removed, because the crawler cannot see the noindex instruction.
A third is thinking one file governs everything. It governs a single host, protocol and port, so subdomains and protocol variants need their own files.
A fourth is the idea that more blocking is always safer. Over blocking can hide important pages, break rendering when CSS or JavaScript files are blocked, and cut a site off from search.
Risks and boundaries
The headline risk is accidental self sabotage. A stray Disallow: / can remove a site from search, and because robots.txt is rarely reviewed, the error can go unnoticed.
The second is confusing crawl management with access control. Listing a path in robots.txt advertises that the path exists, so it is the wrong tool for anything sensitive. Standards bodies, including the US National Institute of Standards and Technology, advise against relying on secrecy of this kind, since system security should not depend on hiding the existence of a resource.
The third is the sequencing risk between noindex and a robots.txt block, described above.
The fourth is false confidence about AI bots. Reputable crawlers respect the file, but it cannot force compliance. In August 2025, Cloudflare reported that Perplexity was using stealth, undeclared crawlers to evade no crawl directives, including a generic browser user agent that impersonated Google Chrome on macOS when its declared crawler was blocked, after which Cloudflare removed Perplexity from its list of verified bots. Treat robots.txt as a polite, public request and use server side controls where enforcement matters.
What to do next
Start by inventorying which sections of your site you want kept open and which you want kept out of crawling, and be honest that keeping out of crawling is not the same as keeping private or keeping out of results.
Review your crawl controls together rather than in isolation, because robots.txt, sitemaps, canonical tags and noindex interact. Give one named owner accountability for the file, keep it short and commented, and re test it before any major launch or migration. For anything genuinely confidential, use authentication or a restricted environment, never robots.txt alone.
FAQs
Should I block my internal search results pages in robots.txt?
You can, and many large sites do, because internal search pages create low value, near duplicate URLs. If those pages are already indexed, allow crawling and apply noindex first so they drop out, then block the path. Blocking alone can leave them stuck in results.
Can robots.txt stop AI tools using my content?
Only the ones that choose to comply. The major AI companies publish user agent names you can allow or block, and Google-Extended lets you opt out of Google's generative model training while staying in Search. None of this is enforceable, so use server side controls for anything you must prevent.
If I block a page in robots.txt, will it disappear from Google?
Not reliably. Blocking controls crawling, not indexing. A blocked URL can still appear if other pages link to it, usually with no description. To remove a page, allow crawling and use noindex, or remove the page and return a 404 or 410.
Is robots.txt a security measure?
No. The file is public and listing a path simply advertises that it exists. Use authentication, password protection or a restricted environment for sensitive content.
Does one robots.txt file cover my whole site?
It covers one host, protocol and port. Subdomains such as shop.example.com and protocol variants need their own file at their own root.
Can I stack robots.txt with noindex?
You can, but the order matters. The crawler must be allowed to fetch the page to see the noindex. Allow crawling until the page has dropped out of results, then add a block if you want to save crawl effort.
How big can a robots.txt file be?
Crawlers following the standard must parse at least 500 kibibytes. Keeping the file short and well commented is good practice and reduces the chance of errors.
Do I still need robots.txt if my site is small?
Often not in any active sense. A small, clean site can be crawled fine without elaborate rules. A simple file, or none at all, is usually adequate.
Sources
RFC 9309: Robots Exclusion Protocol (Internet Engineering Task Force (IETF)). The formal standard, its September 2022 publication, parsing and caching rules, and the statement that the protocol is not a content security measure.
About /robots.txt (The Web Robots Pages (robotstxt.org)). The original de facto standard, file location, and the point that robots can ignore the file and that it is publicly visible.
Sitemaps XML format and protocol (sitemaps.org). How the Sitemap directive in robots.txt advertises a sitemap location to search engines.
How Google Interprets the robots.txt Specification (Google Search Central). How a major crawler applies RFC 9309, longest match precedence, and the separation of crawling from indexing.
