Technical SEOinformational intent

How to Fix robots.txt Rules That Block ChatGPT and Other AI Crawlers

A technical step-by-step guide to finding and fixing robots.txt directives that silently block GPTBot, ClaudeBot, and PerplexityBot from accessing your content. Includes common patterns, CDN pitfalls, and a testing workflow.

Feb 21, 202611 min readTechnical SEO teams, developers, and site operators
robots.txt blocking chatgpt fixclaudebot robots.txt allow rulegptbot disallow vs allowai crawler robots auditunblock gptbot robots.txtrobots.txt ai crawlers

Why your robots.txt is probably blocking AI crawlers right now

Most robots.txt files were written before AI crawlers existed. They were designed to control Googlebot, Bingbot, and a handful of other search engine crawlers. When OpenAI launched GPTBot in 2023 and Anthropic launched ClaudeBot shortly after, these new bots inherited whatever default rules your robots.txt already had — and in many cases, those defaults silently block them.

The three most common patterns that cause AI crawler blocks: a blanket 'User-agent: * / Disallow: /' that blocks everything, security plugins that auto-generate restrictive rules, and overly cautious wildcard patterns that were intended to block scrapers but also match legitimate AI bot user agent strings.

Unlike Google Search, where being blocked means you drop in rankings gradually, AI crawler blocks are binary. If GPTBot can't access your site, your content doesn't exist in ChatGPT's retrievable knowledge. There is no partial visibility — you're either accessible or you're not.

Step 1: Check your current robots.txt

Open yourdomain.com/robots.txt in your browser right now. Look for these specific patterns.

Pattern A — Global block: 'User-agent: *' followed by 'Disallow: /' on the next line. This blocks every crawler that isn't specifically allowed elsewhere in the file. If you have this pattern and no specific allow rules for AI bots, you are invisible to every AI platform.

Pattern B — Bot-specific block: Look for lines mentioning GPTBot, ChatGPT-User, ClaudeBot, PerplexityBot, Bytespider, or CCBot with Disallow rules. Some WordPress security plugins and Cloudflare configurations auto-generate these blocks.

Pattern C — Wildcard traps: Rules like 'Disallow: /*?' or 'Disallow: /*.json' can accidentally block AI crawlers from accessing query-based pages or structured data files. Even 'Disallow: /api/' blocks WebMCP-style endpoints that AI agents need.

Step 2: Understand which bots to allow

Not every bot needs access, but blocking the wrong ones has measurable business impact. Here's how to think about it.

GPTBot, ChatGPT-User, and OAI-SearchBot are three separate OpenAI bots. GPTBot crawls for training data. ChatGPT-User is the live browsing agent. OAI-SearchBot powers SearchGPT. Blocking GPTBot alone still allows live browsing; blocking ChatGPT-User prevents users from viewing your site through ChatGPT. Most businesses should allow all three.

ClaudeBot (Anthropic) and PerplexityBot (Perplexity AI) are similarly important. Claude is the second most popular AI assistant, and Perplexity is the fastest-growing AI search engine. Blocking either one removes your site from a significant AI distribution channel.

Google-Extended is specifically for Gemini and Google's AI features — separate from Googlebot. Blocking Google-Extended does NOT affect your Google Search rankings. But it does remove your content from Gemini's knowledge, AI Overviews, and other Google AI products.

Step 3: Write the fix

The safest approach is explicit allow rules for each AI crawler you want to let in. Add these blocks to your robots.txt, each with its own User-agent line. For each bot, specify 'User-agent: GPTBot' (or the relevant name) followed by 'Allow: /'.

If you have a global 'Disallow: /' under 'User-agent: *', you need bot-specific sections placed BEFORE the global block in the file. In robots.txt, the first matching user-agent section wins for each bot. A specific 'User-agent: GPTBot / Allow: /' section placed before 'User-agent: * / Disallow: /' will correctly allow GPTBot while blocking unnamed bots.

Avoid using 'Allow: /' under 'User-agent: *' as a fix — this opens your site to every bot including aggressive scrapers. Instead, keep the global block and add explicit allow sections for each AI bot you want to let through.

Step 4: Check for CDN and WAF overrides

This is the step most guides skip, and it's where many teams waste hours debugging. Your robots.txt can be perfectly configured while your CDN or WAF silently blocks AI crawlers at the network level.

Cloudflare Bot Fight Mode is the most common culprit. It classifies many AI crawlers as 'automated threats' and serves them challenge pages or 403 errors. The bot never sees your robots.txt or your content — it gets blocked before reaching your server. To fix this, go to Cloudflare Security > Bots and create custom rules that skip bot fight mode for specific AI crawler user agents.

AWS CloudFront with AWS WAF, Akamai Bot Manager, and Vercel's built-in DDoS protection can all have similar effects. Check your CDN's bot management or firewall rules for any AI crawler user agent strings that might be blocked or challenged.

The definitive test: check your server access logs (or CDN analytics) for requests from these user agents. If you see 200 response codes, they're getting through. If you see 403, 429, or no entries at all, you have a network-level block.

Step 5: Verify and monitor

After deploying your robots.txt changes and CDN rule updates, verify that each bot can actually access your pages.

For a quick check, use curl with a custom user agent: request a page with the User-Agent header set to 'GPTBot/1.0' and check that you receive a 200 response with your actual page content (not a challenge page or redirect). Repeat for ClaudeBot, PerplexityBot, and other bots you've allowed.

For ongoing monitoring, set up a weekly automated check. robots.txt changes can revert after CMS updates, plugin reinstalls, or CDN configuration syncs. What worked last week might be broken today because a WordPress plugin update regenerated the file.

An AI visibility audit tool can automate this entire verification — testing all AI crawler user agents against your robots.txt rules, CDN behavior, and actual HTTP responses in a single scan.

WordPress-specific pitfalls

WordPress sites are disproportionately affected because plugins frequently modify robots.txt without user awareness. Yoast SEO, All in One SEO, Wordfence, iThemes Security, and Sucuri all have settings that can add bot blocks.

Wordfence's 'Block Fake Crawlers' feature can misidentify AI crawlers. iThemes Security's 'Bot Fight' mode blocks unknown user agents. Some caching plugins add restrictive robots.txt rules when serving cached versions.

After fixing your robots.txt, check it again after the next plugin or core WordPress update. WordPress regenerates robots.txt dynamically — your changes might not survive an update if they're not made through the plugin's settings rather than direct file editing.

Execution Checklist

  • Open yourdomain.com/robots.txt and check for global disallow, bot-specific blocks, and wildcard traps.
  • Add explicit 'User-agent: GPTBot / Allow: /' sections for each AI crawler you want to allow.
  • Place AI bot allow rules BEFORE any global 'User-agent: * / Disallow: /' block.
  • Check CDN/WAF settings (Cloudflare Bot Fight Mode, AWS WAF) for silent bot blocking.
  • Test with curl using AI crawler user agent strings to verify 200 responses.
  • Review WordPress plugin settings for auto-generated bot blocks (Wordfence, Yoast, security plugins).
  • Set up weekly monitoring — robots.txt can revert after CMS or plugin updates.

FAQ

Should I allow every AI crawler in robots.txt?

For most businesses, yes. The visibility benefits of being recommended by ChatGPT, Claude, Gemini, and Perplexity outweigh the costs for sites with public content. Consider blocking specific crawlers only if you have paywall content, licensing restrictions, or compete directly with the AI platform's own products. Even then, consider allowing retrieval bots (which cite you in real-time answers) while blocking training bots.

Can a CDN or firewall override my robots.txt settings?

Yes, and this is extremely common. CDN bot management (Cloudflare Bot Fight Mode, AWS WAF rules, etc.) operates at the network level before the request reaches your server. A crawler can be allowed by robots.txt but blocked by a firewall rule that returns a 403 or CAPTCHA challenge. Always verify actual HTTP responses, not just robots.txt rules.

I fixed robots.txt but ChatGPT still doesn't mention my site. Why?

robots.txt is necessary but not sufficient. After allowing crawlers, it takes time for them to revisit your site and update their indexes. For retrieval-based systems (Perplexity, ChatGPT Browse), this can happen within days. For training-based knowledge (model memory), it may take weeks to months depending on retraining schedules. Also check that your site has structured data, llms.txt, and content specific enough for AI models to cite confidently.

Run Free AuditView PricingBack to Blog

Related Posts