Technical SEOinformational intent

The Correct robots.txt Rules for GPTBot, ChatGPT-User, and OAI-SearchBot

A precise reference for writing robots.txt rules that OpenAI's crawlers will actually honor. Covers exact user agent strings, precedence rules, path-specific allows, and a working test harness.

Apr 11, 202612 min readDevelopers and technical SEOs writing robots.txt by hand
correct robots.txt rules for gptbotgptbot user agent stringchatgpt-user robots.txtoai-searchbot robots.txtrobots.txt syntax ai botsgptbot allow directive

Why 'correct' matters for AI bots specifically

robots.txt is an old standard with a lot of inconsistent interpretation. Googlebot has spent 20 years forgiving mistakes, normalizing whitespace, ignoring comments in unexpected places, and making reasonable guesses when rules conflict. OpenAI's crawlers are newer and stricter. A rule that Googlebot would charitably interpret as 'allow' can be read by GPTBot as 'block', and vice versa. Getting the syntax exactly right matters more here than it does for search engines.

The official documentation for each OpenAI crawler lives at platform.openai.com/docs/bots. It lists the user agent strings, the IP ranges the bots crawl from, and the parts of the robots.txt spec the crawlers actually respect. Anything beyond what's documented is undefined behavior — your rule might work today and stop working after a crawler update.

This post is a precise reference, not a tutorial. If you want the high-level 'how to allow AI crawlers' walkthrough, start with our dedicated guide. What follows is the exact rule syntax, the precedence logic, and the test commands to confirm each bot sees what you intend.

The exact user agent strings

OpenAI operates three separate crawlers, each with a distinct user agent and a distinct purpose. Getting the string exactly right is critical — a typo or missing hyphen means the rule silently applies to nothing.

  • GPTBot — Full UA: 'Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.2; +https://openai.com/gptbot)'. Token for robots.txt matching: GPTBot. This bot collects content for training future GPT models.
  • ChatGPT-User — Full UA: 'Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ChatGPT-User/1.0; +https://openai.com/bot)'. Token for robots.txt: ChatGPT-User (note the hyphen and exact capitalization). This is the live fetcher used when a user asks ChatGPT to browse or click a citation.
  • OAI-SearchBot — Full UA: 'Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; OAI-SearchBot/1.0; +https://openai.com/searchbot)'. Token for robots.txt: OAI-SearchBot (again, hyphenated). Powers OpenAI's search features inside ChatGPT and SearchGPT.

Section precedence: which rule wins

The robots.txt specification says: for each crawler, only the most specific matching user agent group applies. A single crawler never inherits rules from multiple groups. If GPTBot matches a 'User-agent: GPTBot' section, it ignores any 'User-agent: *' section in the same file.

This is the rule that confuses most people. If you have 'User-agent: *' with 'Disallow: /admin' and then later 'User-agent: GPTBot' with 'Allow: /', GPTBot will not honor the admin disallow — it only follows its own named section. If you want GPTBot to respect both the admin block and a general allow, you need to put 'Disallow: /admin' inside the GPTBot section explicitly.

Section order within the file does not matter for selection — the crawler finds its most specific matching group regardless of where that group appears. Order does matter for the rules within a group: when multiple Allow and Disallow directives apply to the same URL, the most specific (longest) match wins. This is different from the 'first match wins' behavior some older crawlers use.

Path-specific rules: allowing some pages, blocking others

A common mistake is writing 'Allow: /blog' under 'User-agent: GPTBot' and assuming it blocks everything else. It doesn't. An Allow directive is permissive-only — adding it does not imply a blanket disallow on unlisted paths. To block everything except /blog you need both an 'Allow: /blog' and a 'Disallow: /' in the same section.

Another subtlety: longest-match wins. If you have 'Allow: /blog' and 'Disallow: /blog/drafts/', GPTBot will crawl /blog/public-post but honor the disallow on /blog/drafts/private. The more specific Disallow takes precedence. You can chain allows and disallows to carve out exactly the permission surface you want, as long as you keep track of specificity.

Wildcards ('*') and end-of-path markers ('$') work in GPTBot's robots.txt parser, but test them before shipping. 'Disallow: /*.json$' blocks all JSON files. 'Disallow: /private/*' blocks everything under /private. Misplaced wildcards are a common cause of accidental blocks — a rule like 'Disallow: /*?' intended to block tracking parameters will also block legitimate query-based pages like /search?q=something.

A verified-correct template

Here's a robots.txt structure that passes every correctness check and is the safest starting point for most sites. It allows all three OpenAI crawlers to access everything except two common private paths, and it places the OpenAI sections above any global wildcard block so they take precedence cleanly.

For each of GPTBot, ChatGPT-User, and OAI-SearchBot, write a four-line section in this order: a User-agent line with the token, an 'Allow: /' line, a 'Disallow: /admin/' line, and a 'Disallow: /account/' line. Then insert a blank line and repeat for the next bot. No inline comments (some parsers are inconsistent with them), no trailing whitespace, and no Windows line endings.

The biggest correctness improvements compared to typical robots.txt files: (1) named sections for each bot rather than relying on a wildcard, (2) both Allow and Disallow rules inside each section so permissions are explicit, and (3) the blank line between sections, which some older parsers require for section boundaries to be detected. A file that works for Googlebot can still fail these criteria; GPTBot is stricter.

Testing: prove your rules work

OpenAI does not currently offer a robots.txt tester like Google Search Console's. The best substitute is a two-step test: first, a syntactic check that your file parses correctly, and second, a live request test that the bot sees what you expect.

For the syntactic check, paste your robots.txt into a strict parser (the Python robotparser library or the google-robots package both work, and are closer to OpenAI's behavior than lenient browser-based testers). Run a test URL through the parser and assert the expected allow/disallow result. This catches typos and misplaced wildcards in seconds.

For the live test, curl the page with each of the three OpenAI user agent strings and check the response. A correctly allowed page returns 200 with your HTML. A correctly blocked page should return 200 with no content, or 404, or 403 — but NOT a challenge page from your CDN (that means network-layer blocking is overriding your robots.txt). The cleanest end-to-end test is to fetch a page that you know should be allowed and a page you know should be blocked, and confirm both responses match expectations.

Execution Checklist

  • Use the exact user agent tokens: GPTBot, ChatGPT-User, OAI-SearchBot — case and hyphens must match.
  • Give each bot its own named section; do not rely on 'User-agent: *' to cover them.
  • Remember that named sections override wildcard sections — duplicate any wildcard Disallow rules inside the bot section if you want them to apply.
  • Use longest-match specificity to carve out sub-path exceptions within allowed directories.
  • Separate each user agent group with a blank line; avoid inline comments inside groups.
  • Validate the file with a strict robots.txt parser before deploying.
  • Run live curl tests with each user agent against both allowed and blocked paths to confirm end-to-end behavior.

FAQ

Is the user agent token case-sensitive?

robots.txt is generally case-insensitive for user agent matching, and OpenAI documents its tokens in the forms GPTBot, ChatGPT-User, and OAI-SearchBot. For maximum safety, match the documented capitalization exactly. Some third-party parsers and older tools do treat the token as case-sensitive, so writing 'gptbot' in lowercase can cause intermittent issues.

Do I need to include the full user agent string or just the token?

Just the token. A line like 'User-agent: GPTBot' is correct — GPTBot will match this section when it crawls, based on the 'GPTBot' substring in its full user agent. You never need to paste the entire UA string into robots.txt.

Can I use 'Crawl-delay' to rate-limit GPTBot?

OpenAI does not officially honor Crawl-delay. If you're concerned about GPTBot's crawl rate, the supported approach is to block specific high-cost paths (search results, faceted navigation) via Disallow directives, or rate-limit at the CDN layer. Crawl-delay will be silently ignored.

What happens if my robots.txt has a syntax error?

OpenAI's crawlers will try to parse as much of the file as they can and ignore lines they don't understand. This sounds safe but it's dangerous: a malformed user agent line can cause a whole group to be dropped, falling back to a less-restrictive wildcard section. Always validate with a strict parser rather than trusting lenient behavior.

Run Free AuditView PricingBack to Blog

Related Posts