Technical SEO • informational intent

AI Crawler User Agents in 2026: GPTBot, ClaudeBot, and Every Bot You Need to Know

A complete reference of every AI crawler user agent string, what each one does, how they respect robots.txt, and how to configure access for maximum AI visibility.

Feb 28, 202612 min readWeb developers, SEO engineers, and site operators

ai crawler user agents listgptbot user agentclaudebot robots.txtperplexitybot crawlerai bot user agent stringsgoogleother user agent

Why AI crawlers matter more than ever

Search engine crawlers have indexed the web for decades, but a new generation of AI-specific crawlers now determines whether your content appears in ChatGPT answers, Claude responses, Gemini summaries, and Perplexity search results. These crawlers operate differently from Googlebot: they fetch pages to build training datasets, populate retrieval-augmented generation (RAG) indexes, or answer user queries in real time.

If your robots.txt blocks one of these bots, your site silently disappears from that AI platform's knowledge base. Unlike traditional search where you drop in rankings, AI exclusion is binary — you're either in the model's context window or you're not.

The complete user agent reference

Below is every known AI crawler user agent as of early 2026, grouped by company. Each entry includes the bot name, what it does, whether it respects robots.txt, and how frequently it typically crawls.

GPTBot (OpenAI) — Fetches pages for ChatGPT's web browsing and training data. Respects robots.txt. User agent: 'GPTBot/1.0'. Typical crawl: daily to weekly depending on site authority.
OAI-SearchBot (OpenAI) — Dedicated to OpenAI's SearchGPT product for real-time search results. Respects robots.txt. Separate from GPTBot, so blocking one doesn't block the other.
ChatGPT-User (OpenAI) — The live browsing agent used when a ChatGPT user clicks 'Browse the web'. Respects robots.txt. Fetches single pages on demand, not bulk crawling.
ClaudeBot (Anthropic) — Crawls pages for Claude's training and retrieval. Respects robots.txt. User agent: 'ClaudeBot/1.0'. Moderate crawl frequency.
Google-Extended (Google) — Controls whether your content is used for Gemini and other Google AI products, separate from regular Google Search indexing. Respects robots.txt. Blocking this does NOT affect your Google Search rankings.
GoogleOther (Google) — General-purpose Google crawler for non-search products including AI training. Respects robots.txt.
PerplexityBot (Perplexity AI) — Fetches pages for Perplexity's real-time AI search engine. Respects robots.txt. High crawl frequency on news and frequently updated sites.
Bytespider (ByteDance) — TikTok's parent company crawler, used for AI training. Respects robots.txt. Known for aggressive crawl rates.
CCBot (Common Crawl) — Open web crawl dataset used to train many open-source and commercial AI models. Respects robots.txt. Quarterly full crawls.
Amazonbot (Amazon) — Crawls for Alexa AI answers and Amazon product intelligence. Respects robots.txt.
FacebookBot (Meta) — Used for Meta AI features and training data. Respects robots.txt.
Applebot-Extended (Apple) — Separate from standard Applebot (Siri/Safari). Used for Apple Intelligence features. Respects robots.txt.

How to configure robots.txt for AI crawlers

The safest approach is explicit allow rules for the crawlers you want, rather than relying on the absence of deny rules. This makes your intent clear and protects against future changes when new bots appear.

A common mistake is using a blanket 'Disallow: /' for all unknown bots with a wildcard rule. This blocks every new AI crawler by default, including bots from platforms where your audience actively searches. Instead, whitelist the bots you want to allow and only block specific bots you have reason to exclude.

Another frequent issue is blocking GPTBot but forgetting ChatGPT-User and OAI-SearchBot. These are three separate user agents from OpenAI. Allowing GPTBot but blocking ChatGPT-User means your pages can be in ChatGPT's training data but users can't browse your site through ChatGPT's live web access.

Verifying crawler access actually works

Adding robots.txt rules is necessary but not sufficient. CDN-level bot protection (Cloudflare Bot Management, AWS WAF, Akamai) can silently block AI crawlers even when robots.txt allows them. The bot receives a 403 or challenge page instead of your content.

To verify access: check your server access logs for each bot's user agent string and confirm they receive 200 responses. If you use Cloudflare, check the Firewall Events tab for any AI crawler blocks. For sites behind a WAF, create explicit allow rules for the user agent strings you want to let through.

The most reliable test is to run an AI visibility audit that checks both robots.txt rules and actual HTTP response codes for each crawler.

Crawler behavior differences that affect your strategy

Not all AI crawlers work the same way. Understanding the differences helps you make better access decisions.

Training crawlers like GPTBot, ClaudeBot, and CCBot fetch your content to include in model training datasets. Once your content is in the training data, the model 'knows' it without fetching again. Blocking these bots after your content is already in the training set has no retroactive effect.

Retrieval crawlers like PerplexityBot and OAI-SearchBot fetch pages in real time or near-real-time to answer specific user queries. Blocking these has immediate impact — your pages stop appearing in answers within hours.

Live browsing agents like ChatGPT-User fetch single pages on demand when a user asks the model to visit a URL. Blocking this prevents users from sharing your URLs in chat conversations for analysis.

A practical decision framework

For most businesses, the right answer is to allow all AI crawlers. The visibility benefits outweigh the costs. Your content is already public, and AI recommendations drive measurable referral traffic.

Consider blocking specific bots only if you have a concrete business reason: you sell content behind a paywall, you have licensing restrictions, or you compete directly with the AI platform's own products. Even then, block the specific training crawlers while keeping retrieval and browsing bots allowed, so your site can still be referenced in real-time answers.

Execution Checklist

• Audit your robots.txt for blanket wildcard rules that may block unknown bots.
• Add explicit allow rules for GPTBot, ChatGPT-User, OAI-SearchBot, ClaudeBot, PerplexityBot, and Google-Extended.
• Check CDN and WAF firewall rules — they override robots.txt.
• Verify actual 200 responses for each bot in server access logs.
• Review and update rules quarterly as new AI crawlers launch.

FAQ

Does blocking GPTBot affect my Google Search rankings?

No. GPTBot is OpenAI's crawler and has no effect on Google Search. Similarly, blocking Google-Extended only affects Google's AI products like Gemini — your regular Google Search indexing by Googlebot is unaffected.

Can I allow AI crawlers to read my content but prevent them from using it for training?

Partially. Some crawlers (like Google-Extended) are specifically for AI training versus search indexing. You can block the training-specific bot while allowing others. However, there is no universal opt-out mechanism across all AI platforms. The Robots Exclusion Protocol controls access, not usage rights — for that, you need TDM (Text and Data Mining) declarations in your terms of service.

What happens if I block all AI crawlers?

Your site will gradually disappear from AI-generated answers across ChatGPT, Claude, Gemini, and Perplexity. For retrieval-based bots, the effect is near-immediate. For training bots, existing knowledge persists until the model is retrained without your data. Net effect: you lose a growing traffic channel with no benefit unless you have specific licensing or paywall concerns.

Run Free Audit View Pricing Back to Blog