robots.txt for AEO
robots.txt is the first file AI crawlers check when visiting a domain. A misconfigured robots.txt can make an entire site invisible to AI answer engines. AEO requires the opposite of traditional restrictive configurations — broad allowance for all known AI crawlers with specific denials only where absolutely necessary. Getting this file right is the single most important technical prerequisite for answer engine visibility, and getting it wrong means every other optimization effort is wasted.
Complete robots.txt Template
The following template is a production-ready robots.txt configuration designed specifically for AEO. It explicitly allows every major AI crawler currently operating on the web. Copy this file to the root of your domain and replace yourdomain.com with your actual domain name. The file must be accessible at https://yourdomain.com/robots.txt with no redirects and no authentication requirements.
User-agent: *
Allow: /
User-agent: GPTBot
Allow: /
User-agent: ChatGPT-User
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: anthropic-ai
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Google-Extended
Allow: /
User-agent: Googlebot
Allow: /
User-agent: Bingbot
Allow: /
User-agent: Applebot-Extended
Allow: /
Sitemap: https://yourdomain.com/sitemap.xmlThe template begins with a permissive wildcard rule that allows all crawlers by default. Each AI crawler is then listed individually with an explicit Allow: / directive. While the wildcard technically covers them, explicit per-crawler entries eliminate ambiguity and ensure that no crawler-specific deny rule accidentally overrides the default. The Sitemap directive at the bottom points crawlers to your XML sitemap for efficient content discovery.
What Each Crawler Does
Understanding the purpose of each crawler helps you make informed decisions about which ones to allow. Every crawler in the template above serves a different AI ecosystem, and blocking any single one removes your content from that ecosystem entirely.
- GPTBot — OpenAI's web crawler for ChatGPT search features and model training. Determines whether your content appears in ChatGPT responses.
- ChatGPT-User — The browsing agent activated when a ChatGPT user clicks "Browse." It fetches pages in real-time on behalf of the user, meaning a block here prevents ChatGPT from reading your content live.
- ClaudeBot — Anthropic's crawler for Claude's web access capabilities. Retrieves and reads pages when Claude users ask questions requiring current web information.
- anthropic-ai — Anthropic's training data crawler, separate from ClaudeBot. Collects content used to improve Claude's base knowledge across your topic area.
- PerplexityBot — Perplexity's real-time search crawler and the most aggressive retrieval-path crawler among AI systems. Perplexity directly links to sources, making visibility here especially valuable for referral traffic.
- Google-Extended — Google's AI-specific crawler for Gemini training, entirely separate from Googlebot. Blocking it does not affect search rankings but removes your content from Gemini's training data.
- Googlebot — Google's traditional search crawler, but also the source that feeds Google AI Overviews. Blocking Googlebot eliminates you from both traditional search and AI Overviews.
- Bingbot — Microsoft's search crawler. Content indexed by Bingbot feeds directly into Microsoft Copilot's AI responses across Bing, Edge, and Microsoft 365 products.
- Applebot-Extended — Apple's crawler for Apple Intelligence features including enhanced Siri responses and Spotlight search. Its importance grows as Apple Intelligence expands.
Verification Steps
Deploying a robots.txt file is not enough. You must verify that it is actually being served correctly and that no infrastructure layer is overriding it. Follow these four steps after every robots.txt change.
- Check your live robots.txt directly. Open
https://yourdomain.com/robots.txtin a browser and confirm the content matches your intended configuration. - Verify no hosting-level blocks exist. Some CDNs and hosting providers add default robots.txt rules or override your file. Check your provider's documentation to ensure your custom file takes precedence.
- Check WAF and firewall rules. Cloudflare WAF, AWS WAF, and similar services may block bot user-agents at the network level even if robots.txt allows them. A firewall returning 403 to GPTBot renders your robots.txt irrelevant — the crawler never gets past the network layer to read it.
- Test with curl. Simulate an AI crawler request from your terminal to verify you get a 200 response, not a 403 or CAPTCHA challenge.
curl -A "GPTBot" https://yourdomain.com/If curl returns a 200 status code and the HTML of your page, the crawler can access your content. If it returns 403, 429, or a Cloudflare challenge page, your firewall or rate limiter is blocking the request regardless of what robots.txt says.
Common Mistakes
Common Mistakes That Block AI Crawlers
- Wildcard Disallow from old SEO configs. A legacy
User-agent: * / Disallow: /rule blocks every crawler that does not have its own explicit allow rule. Many sites inherited this configuration from years-old SEO setups and never updated it. - Hosting provider defaults that block non-standard user agents. Some shared hosting environments and security plugins treat any user-agent string they do not recognize as suspicious. AI crawlers like GPTBot and PerplexityBot are still relatively new and may not be on default allow lists.
- WAF rules that return 403 to unknown bots. This is the most insidious mistake because robots.txt means nothing if the firewall blocks the request first. The crawler never reaches your server, so it never reads your robots.txt at all. Always check firewall logs alongside your robots.txt configuration.
- Forgetting the Sitemap directive. Without a Sitemap line in robots.txt, crawlers must discover your pages through link following alone. This is slower, less complete, and means newly published pages may not be discovered for days or weeks.
- Using Crawl-delay with AI bots. Most AI crawlers ignore the
Crawl-delaydirective, but some interpret its presence as a signal that the site does not welcome automated access. Avoid using Crawl-delay for AI crawler user-agent blocks entirely.
A properly configured robots.txt is the foundation of AI crawler access, but it is only one layer of a complete AEO technical implementation. For a deeper understanding of why AI crawler access matters and how it fits into the broader AEO framework, see Principle 3: AI Crawler Access. Once your robots.txt is verified and serving correctly, the next step is to create an llms.txt file that gives AI systems a structured overview of your site's content and purpose.