Technical SEO • informational intent
How to Block AI Bots with robots.txt (and When You Actually Should)
The legitimate reasons to block AI crawlers, the robots.txt syntax that actually works, and the trade-offs you're accepting in exchange. Covers GPTBot, ClaudeBot, PerplexityBot, Bytespider, and CCBot.
Most businesses should not block AI bots. If you should, this guide is for you
For the majority of sites, blocking AI bots is a self-inflicted visibility wound. The tangible upside — preventing your content from being used in training data — is small compared to the cost of being invisible in ChatGPT, Claude, Perplexity, and other AI surfaces that now drive meaningful discovery traffic. Before you block, be honest about whether you have a real reason or just a reflex from reading a 2023 publisher manifesto.
That said, there are legitimate cases. Licensed content publishers (news, research, media libraries) have contractual obligations that prohibit unrestricted reuse. Businesses whose core product IS their content (training courses, paywalled databases, original reporting) have a rational reason to guard it. Sites operating under specific regulatory constraints may be prohibited from consenting to AI use. If you're in one of those categories, this guide gives you the exact syntax and trade-offs.
What this post won't do is pretend blocking is free. Blocking means losing AI-driven referral traffic from Perplexity, losing citation surfaces in ChatGPT Browse, and losing any training-data inclusion that might later drive organic mentions. Every block is a trade, and you should make the trade with both sides of it in view.
The legitimate reasons to block (and the bad ones)
Good reasons to block AI bots: your content is licensed from third parties whose terms prohibit AI training use; you run a paywall or subscription product where content access is the product; you publish original research or investigative journalism where unauthorized reproduction undercuts your business; you have a legal or regulatory requirement to control downstream use.
Weaker reasons that people cite but usually don't survive analysis: 'I don't want AI to steal my ideas' (AI models do not verbatim regurgitate most content, and the legal landscape for training use is still being decided), 'I'll lose traffic to AI summaries' (you lose more traffic by being invisible than by being summarized), 'I'm making a principled stand' (a reasonable personal choice, but not a business strategy).
The clearest test: if your business model requires readers to come to your site and pay to access content, blocking is rational. If your business model requires readers to discover your brand and buy a product or service that lives elsewhere, blocking usually hurts you more than it helps.
The minimal robots.txt block that actually works
If you decide to block, block explicitly by name. Do not rely on a 'User-agent: * / Disallow: /' block to catch AI crawlers — some of them honor it, some don't, and the ambiguity means you can't tell whether your block is working without per-bot testing.
For each bot you want to block, write a dedicated section: 'User-agent: GPTBot' followed by 'Disallow: /'. Repeat for each of ChatGPT-User, OAI-SearchBot, ClaudeBot, PerplexityBot, Bytespider (ByteDance), CCBot (Common Crawl), Applebot-Extended (Apple Intelligence), and Google-Extended (Gemini and Google AI).
Place these sections above any wildcard rules in your file. Named sections take precedence over wildcard sections for their specific bot, so putting them first is the conventional (and safer) ordering.
Bots you probably haven't heard of but should decide about
The 'big five' AI crawlers (GPTBot, ClaudeBot, PerplexityBot, Google-Extended, Bytespider) get most of the attention, but there are several smaller bots that can add up. A partial block is often worse than either a full allow or a full block — it leaves you invisible in some AI products while still being crawled by others with no strategic reason behind the distinction.
- Applebot-Extended — Apple's signal for opting out of Apple Intelligence training. Separate from the main Applebot (which does Siri and Spotlight). If you're blocking OpenAI but allowing Apple Intelligence, you should know that's the choice you're making.
- Bytespider — ByteDance's crawler, feeds Doubao and other ByteDance AI products. Very active, especially in Asia-Pacific markets. Some sites block it specifically because of crawl volume rather than content licensing concerns.
- CCBot — Common Crawl's crawler, whose output feeds dozens of downstream AI training datasets. Blocking CCBot affects not just one product but many research and open-source AI projects. Often overlooked in block lists.
- Diffbot, FacebookBot, Amazonbot — Crawlers from companies that may be building internal AI products. Their visibility impact is lower than the big five but they're commonly included in comprehensive block lists.
- Omgilibot, Meltwater, DataForSeoBot — SEO and market intelligence crawlers. Usually not what people mean when they say 'AI bot', but they show up in bot block lists and are worth a deliberate decision.
Partial blocks: blocking training while allowing retrieval
A useful middle ground is to block training crawlers (the ones that collect content for future model training) while allowing retrieval crawlers (the ones that fetch pages in real time during user queries). This preserves your ability to be cited in AI answers without committing your content to training data.
The mapping: block GPTBot but allow ChatGPT-User and OAI-SearchBot. Block ClaudeBot's training operation but allow its retrieval operation (currently they share a user agent, so this distinction is limited for Claude). Block CCBot but allow PerplexityBot (Perplexity is almost entirely retrieval-based). Block Google-Extended but allow Googlebot (which handles search indexing separately).
This hybrid approach requires more vigilance. As crawlers evolve, the line between training and retrieval can blur — a bot that was retrieval-only can start contributing to training, or vice versa. If you go this route, re-audit your block list every few months against the current published behavior of each crawler.
What a block actually accomplishes (and doesn't)
A block prevents the specific bot from crawling your site going forward. It does NOT retroactively remove your content from training data if the bot has already crawled you. It does NOT remove mentions of your content that appear because other, already-crawled sites link to or quote you. It does NOT prevent AI models from answering questions about your brand from knowledge learned before the block was added.
For retroactive removal, you need to file opt-out requests with each AI provider directly. OpenAI, Anthropic, and Google all have opt-out forms (sometimes requiring proof of ownership). These are separate from robots.txt and function as requests rather than guaranteed removals — the provider decides how and when to comply.
The honest framing: a block is a forward-looking consent signal, not a deletion tool. If your goal is 'nothing I publish from this point forward is used for AI training', a block works. If your goal is 'scrub every trace of my content from every AI model', robots.txt cannot deliver that — you need legal and opt-out channels as well.
Execution Checklist
- • Write down your reason for blocking and pressure-test it against the visibility trade-off.
- • Block by named user agent, not by wildcard — named rules are the only way to verify per-bot.
- • Decide explicitly about Applebot-Extended, Bytespider, CCBot, and other less-famous crawlers.
- • Consider a hybrid: block training crawlers but allow retrieval crawlers for AI citation presence.
- • Test each block with curl + the bot's user agent to confirm the response is rejection, not 200.
- • File opt-out requests with OpenAI, Anthropic, and Google if you want retroactive removal.
- • Re-audit your block list every quarter — bot behavior and crawler identities shift frequently.
FAQ
Will blocking GPTBot remove me from ChatGPT's answers?
Not immediately, and not completely. Blocking GPTBot prevents future training crawls, but content already in the training set stays until the model is retrained. To affect the live browsing experience, you also need to block ChatGPT-User. Even then, ChatGPT can still mention your brand from training data that predates your block. If you need faster removal, file an opt-out request with OpenAI directly in addition to the robots.txt block.
Is there a legal standard for blocking AI training?
Not yet. The robots.txt standard is a voluntary convention — bots choose whether to honor it. Major AI providers (OpenAI, Anthropic, Google) have publicly committed to respecting robots.txt for their named bots, but enforcement is based on their commitments rather than law. The legal status of using web content for AI training is being tested in courts in several jurisdictions in 2025 and 2026, and the answer is not settled yet.
Can I block AI bots but keep allowing Googlebot for search rankings?
Yes, and this is the correct approach for most 'block AI but not Google' scenarios. Googlebot (for search ranking) is a separate user agent from Google-Extended (for Gemini and AI Overviews). Block Google-Extended to opt out of Google's AI, but leave Googlebot allowed to maintain search presence. The same pattern works for Applebot vs Applebot-Extended.
What's the fastest way to tell if my blocks are actually working?
Curl the target page with each bot's user agent and check the response. A working block returns a 404, 403, or 200 with no content — depending on how your server handles disallowed paths. A broken block returns 200 with your actual HTML, which means the bot would see the page if it obeyed robots.txt. Note that curl testing only proves the block is honored by a bot that honors robots.txt; some low-quality scrapers ignore it entirely and can't be stopped this way.