What AI learned during training. If crawlers couldn't access your site when models were built, you're not in the weights.
What AI fetches live when answering questions. Block crawlers and you lose real-time context too.
AI has two memory layers — parametric knowledge from training data, and retrieval-augmented knowledge from live crawling. If your robots.txt blocks AI crawlers, you lose both. Your brand becomes invisible to the models that shape buyer perception.
This isn't theoretical. When we launched Optimly's Brand Directory, OpenAI's crawlers indexed 150+ pages on day one. That content now shapes how ChatGPT describes every brand in our directory. The front door was open — and it mattered.
150+
pages crawled by OpenAI on our Brand Directory launch day
Copy this, customize the Disallow paths for your site, replace the domain, and upload to your root. That's it.
# ============================================================= # AI-Friendly robots.txt Template # Generated by Optimly — https://optimly.ai # Last updated: March 2026 # ============================================================= # Default rules for all crawlers User-agent: * Allow: / # Block admin and internal paths Disallow: /admin/ Disallow: /api/internal/ Disallow: /cart/ Disallow: /checkout/ Disallow: /account/ Disallow: /login/ # Block URL parameters that create duplicate content Disallow: /search? Disallow: /?utm_ Disallow: /?ref= Disallow: /*?session= # ============================================================= # Explicitly allow AI crawlers # Why: Some crawlers check bot-specific rules first. # An explicit Allow signals intent — you WANT to be indexed. # ============================================================= # OpenAI User-agent: GPTBot Allow: / User-agent: ChatGPT-User Allow: / User-agent: OAI-SearchBot Allow: / # Anthropic User-agent: ClaudeBot Allow: / User-agent: anthropic-ai Allow: / # Google AI User-agent: Google-Extended Allow: / User-agent: GoogleOther Allow: / # Perplexity User-agent: PerplexityBot Allow: / # You.com User-agent: YouBot Allow: / # Cohere User-agent: cohere-ai Allow: / # Apple User-agent: Applebot-Extended Allow: / # Microsoft / Bing User-agent: bingbot Allow: / # Meta User-agent: FacebookBot Allow: / # ============================================================= # Sitemap — replace with your actual sitemap URL # ============================================================= Sitemap: https://YOUR-DOMAIN.com/sitemap.xml # ============================================================= # Companion files for AI discoverability # llms.txt → Token-efficient index of your best content # llms-full.txt → Extended version with full page context # ai-agent-manifest.json → Machine-readable brand positioning # sitemap.xml → Full crawlable URL structure # =============================================================
Four files. One system. Together they control how AI models discover, read, and represent your brand. See the full series →
Known AI crawler user-agents as of March 2026. Bookmark this — we keep it updated.
| User-Agent | Company | Purpose |
|---|---|---|
| GPTBot | OpenAI | Training & inference |
| ChatGPT-User | OpenAI | Live browsing (ChatGPT) |
| OAI-SearchBot | OpenAI | SearchGPT results |
| ClaudeBot | Anthropic | Training & retrieval |
| anthropic-ai | Anthropic | Research crawling |
| Google-Extended | Gemini training | |
| GoogleOther | AI features & research | |
| PerplexityBot | Perplexity | Answer engine retrieval |
| YouBot | You.com | AI search results |
| cohere-ai | Cohere | Enterprise AI training |
| Applebot-Extended | Apple | Apple Intelligence features |
| bingbot | Microsoft | Search + Copilot retrieval |
| FacebookBot | Meta | Meta AI features |
| CCBot | Common Crawl | Open training datasets |
Removes your highest-value content from AI categorization. These pages contain the signals that shape how models describe your brand.
Makes you invisible to every model. You lose both parametric memory and live retrieval — the two ways AI forms opinions about brands.
Creates thousands of duplicate pages. Crawlers waste budget on ?utm_, ?ref=, and ?session= variants instead of your real content.
Crawlers miss important pages. Without an explicit Sitemap line, bots rely on link discovery alone — and skip orphaned pages entirely.
Conflicting signals confuse crawlers. If robots.txt blocks a path that llms.txt links to, neither file is trusted.
Throttles how much AI indexes per visit. Handle rate limiting at the CDN layer instead — it's more precise and doesn't penalize legitimate crawlers.
Download the template above
Customize Disallow paths for your site structure
Replace YOUR-DOMAIN.com with your actual domain
Upload to your site root (must be at /robots.txt)
Verify in Google Search Console
Set up llms.txt and BrandVault companion files
Monitor crawl activity in your server logs
Your robots.txt opens the door. But what are AI models actually saying once they walk in? Search our directory to find out.