Technical SEO • informational intent
How to Allow ChatGPT, Claude, and Perplexity to Crawl Your Website
A copy-paste guide to letting every major AI crawler into your site: robots.txt snippets, CDN allow rules, and a verification checklist. Covers GPTBot, ChatGPT-User, OAI-SearchBot, ClaudeBot, PerplexityBot, and Google-Extended.
Why this is the first thing you should do
Every other AI visibility tactic — structured data, llms.txt, content rewrites — is wasted effort if AI crawlers cannot reach your pages. Allowing the major AI bots is the baseline that everything else builds on. It is also the change that delivers the largest short-term lift, because once a previously-blocked site becomes accessible, retrieval-based systems like ChatGPT Browse and Perplexity start citing it within days.
The goal of this guide is simple: give you the exact rules and configuration steps to open your site to every major AI crawler without opening it to the entire internet. By the end, a curl request using any of the six AI bot user agents should return a 200 response with your actual HTML — not a challenge page, not a 403, not a redirect to your homepage.
This is not a 'should I allow AI crawlers?' article. It assumes you have decided the answer is yes. If you're still weighing the trade-offs between visibility and content licensing, read our post on blocking AI bots first, then come back here.
The six user agents that matter in 2026
AI bot identification has stabilized over the last year. Six crawlers now account for almost all AI visibility traffic, and allowing them covers every major AI assistant.
- GPTBot — OpenAI's training crawler. It collects content used to improve future versions of GPT. Allowing GPTBot is what gets your site into ChatGPT's baseline knowledge. It is distinct from the live browsing agent.
- ChatGPT-User — The live browsing agent that fetches pages in real time when a ChatGPT user clicks a citation or asks a question that triggers browsing. This is the bot that determines whether your site can be cited in the current conversation.
- OAI-SearchBot — OpenAI's crawler for SearchGPT, the search product integrated into ChatGPT. It works similarly to Googlebot: it builds a search index that SearchGPT queries to retrieve pages for users.
- ClaudeBot — Anthropic's crawler, covering both training and real-time retrieval for Claude. Claude is the second most common AI assistant for professional and coding use cases, so missing it means missing a highly commercial audience.
- PerplexityBot — Perplexity AI's crawler. Perplexity has the highest click-through rate of any AI platform because its interface is explicitly designed around citation links, so being indexed here delivers direct trackable traffic in addition to training value.
- Google-Extended — A pseudo-bot that signals whether your content can be used by Google's generative AI products (Gemini, AI Overviews). Crucially, it is separate from Googlebot — allowing it does NOT affect Google Search rankings, and blocking it does NOT remove you from Google Search.
The robots.txt rules to add
This is the minimum viable allow configuration. Add it near the top of your robots.txt file, above any global 'User-agent: *' section. Each AI bot gets its own named section with a bare 'Allow: /' directive so the rule applies to every path on your site. Each section is two lines — a 'User-agent:' line followed by an 'Allow: /' line — with a blank line separating sections.
Write one section for each of the six bots below. The exact user agent tokens matter: type them with the capitalization and hyphens shown.
- User-agent: GPTBot — followed by Allow: /
- User-agent: ChatGPT-User — followed by Allow: /
- User-agent: OAI-SearchBot — followed by Allow: /
- User-agent: ClaudeBot — followed by Allow: /
- User-agent: PerplexityBot — followed by Allow: /
- User-agent: Google-Extended — followed by Allow: /
Why named sections beat a wildcard allow
If you already have a global 'User-agent: *' block with 'Disallow: /', leave it in place. Named sections take precedence over wildcard sections for the specific bot they target. GPTBot will follow the GPTBot section and ignore the wildcard; other unnamed bots will still hit the disallow. This is the cleanest way to selectively open your site to the AI crawlers you want while keeping generic scrapers out.
Avoid the shortcut of changing your wildcard to 'User-agent: * / Allow: /'. It solves the AI visibility problem but opens your site to every bot on the internet, including aggressive content harvesters and data resellers you may not want. Six named sections is three extra minutes of work and a much safer posture.
The CDN rules almost everyone forgets
robots.txt is a suggestion. A firewall rule is a blockade. Almost every site we audit that has 'correct' robots.txt still has AI crawlers being blocked at the network layer by the CDN or WAF. The crawler never reaches your origin, never sees your rules, and never sees your content.
Cloudflare is the biggest culprit. Bot Fight Mode and Super Bot Fight Mode both classify unfamiliar crawlers as potential threats. In Cloudflare's dashboard, go to Security > Bots and either disable Bot Fight Mode entirely or create a custom rule that sets 'Skip bot fight mode' for requests where the User-Agent contains GPTBot, ClaudeBot, PerplexityBot, ChatGPT-User, OAI-SearchBot, or Google-Extended. Cloudflare also maintains a verified bots list — make sure AI crawlers are not categorized as 'likely automated' in your analytics.
AWS CloudFront with WAF and Akamai Bot Manager have equivalent settings. Vercel's firewall allows per-UA allow rules through its project settings. Fastly requires a VCL change to whitelist specific user agents. Whatever you use, check its bot management documentation for how to whitelist verified AI crawlers — the keyword to search for is typically 'verified bots' or 'good bots'.
The fastest way to confirm whether your CDN is the problem is to look at your access logs. Filter for the six user agents above. If you see no requests at all, your CDN is silently dropping them. If you see requests returning 403, 429, or 503, your CDN is actively rejecting them. If you see 200s, you're in good shape.
Verification: prove each bot can actually reach your site
Don't deploy and assume. Run a verification curl for each of the six user agents against a real content page (not just your homepage). The command is: curl -A 'GPTBot/1.0 (+https://openai.com/gptbot)' https://yourdomain.com/important-page — then repeat with each of the other five user agent strings.
A passing result is a 200 response with your actual HTML content in the body. Failing results to watch for: a 403 Forbidden (firewall rejection), a 200 with a Cloudflare challenge page in the body (JS challenge), a 429 (rate limit), or a redirect to a bot-detection page. Any of these means the bot is being blocked even though your robots.txt says it's allowed.
Once all six bots return clean 200s, test a page with your most important structured data — a product page for ecommerce, a docs page for SaaS, a service page for local business. This confirms that the crawler can reach not just your homepage but the pages that actually matter for AI visibility.
What happens next, and when to expect results
Perplexity reindexes quickly. Within a week of being allowed, most sites see their pages start appearing as Perplexity citation sources for queries they're relevant to. This is the fastest feedback loop and the best way to confirm your allow configuration is working.
ChatGPT Browse picks up changes within days as well, though it only crawls a page when a user query triggers browsing. You can nudge this by searching for specific facts from your site in ChatGPT and watching whether it fetches your page in response. If ChatGPT browses and the result includes your citation, the bot is reaching you.
Training-based knowledge (the version of GPT-4, Claude, or Gemini that runs when browsing is off) updates on a slower schedule — weeks to months, depending on when the model is next retrained. Don't expect to appear in 'cold' model answers immediately, but every day your site is open to GPTBot and ClaudeBot increases the odds that your content ends up in the next training run.
Execution Checklist
- • Open yourdomain.com/robots.txt and add allow sections for GPTBot, ChatGPT-User, OAI-SearchBot, ClaudeBot, PerplexityBot, and Google-Extended.
- • Place AI bot allow sections ABOVE any global 'User-agent: * / Disallow: /' block.
- • Disable Bot Fight Mode or create a skip-rule for AI crawler user agents in Cloudflare (or your equivalent CDN).
- • Check server access logs for the six user agents — confirm 200 responses, not 403/429/block pages.
- • Run curl verification with each user agent against a real content page, not just the homepage.
- • Test a page with structured data to confirm the crawler can reach templated pages, not just static ones.
- • Watch Perplexity over the next week for your site appearing as a citation source for relevant queries.
FAQ
Is one 'User-agent: * / Allow: /' enough, or do I need to name every bot?
Technically a wildcard allow works, but it opens your site to every bot including aggressive scrapers and content harvesters you may not want. Named allow sections are safer because they let you keep a default-deny posture for unknown bots while explicitly welcoming the AI crawlers you care about. The extra three minutes of config is worth it.
Does allowing Google-Extended hurt my Google Search rankings?
No. Google-Extended is separate from Googlebot. Googlebot handles search indexing; Google-Extended signals consent for generative AI use. Allowing Google-Extended only affects your presence in Gemini, AI Overviews, and similar Google AI products. Blocking it does not remove you from Google Search, and allowing it does not change your rankings.
Should I also allow ByteDance's Bytespider or Common Crawl's CCBot?
That depends on your goals. Bytespider feeds ByteDance's AI products (including Doubao, which is large in China). CCBot (Common Crawl) is used by many researchers and smaller AI projects to build datasets. Allowing them increases your reach in long-tail AI applications. Blocking them is reasonable if you're concerned about bandwidth costs or content licensing — neither is critical for the top AI products covered in this guide.
How often should I re-check my robots.txt and CDN rules?
At least monthly, and after every CMS update, plugin install, or infrastructure change. WordPress plugins, security tools, and hosting providers frequently regenerate robots.txt or modify bot-management rules without notifying you. We've seen sites go from fully accessible to completely blocked overnight because a WAF update changed its default bot category. Set a recurring reminder, or use a monitoring tool that alerts you when any of the six AI bots stops seeing 200 responses.