AI Crawlers: Block or Allow Is a Choice, Not a Default

Should you block GPTBot? The question is slightly off. AI crawlers aren't something to block by reflex — they're something to choose deliberately, based on your business model. Check your current setup first, understand what you gain and give up, then decide per bot. This is about judgment, not fear marketing.

Why "just block everything" isn't the answer.

robots.txt is a voluntary standard that requests compliance, not a technical barrier. The standard (RFC 9309) says crawlers "are requested to honor" the rules and that "these rules are not a form of access authorization"IETF RFC 9309, 2022. Google's own docs say robots.txt instructions cannot enforce crawler behaviorGoogle Search Central, 2026. Well-behaved crawlers obey; non-compliant ones can ignore it.

And bots differ by purpose. There are separate agents for training, for search visibility, and for fetching a page live when a user asks a question. Block them as one lump and you also close the path to being cited in AI answers. Blocking doesn't even guarantee full removal from an index — Perplexity states that even when blocked via robots.txt, it "may still index the domain, headline, and a brief factual summary"Perplexity, 2026.

Is your site blocking AI bots right now?

The fastest check is to type yourdomain/robots.txt into the address bar. Read the User-agent: GPTBot lines and the Disallow beneath them, and you'll see your current state. If there's no line targeting an AI bot token at all, you're not blocking it.

Reading it is simple. User-agent: names the bot the rule speaks to, Disallow: is the path to keep out, Allow: is the path to permit. Disallow: / means everything; Disallow: /admin/ means only that path. To check structural and technical signals in one pass, run zupzup across its four axes (SEO, GEO, AEO, accessibility).

Block or allow — let your business model decide.

There's no single right answer. Allowing earns you the exposure path of AI-answer citations; blocking earns you content protection and lower server load. Decide by what your real asset is.

ChoiceYou gainYou give upUsually fits
Allow allAI-answer citation, discoverabilityContent may be used for trainingCommerce, local, SaaS
Block training only (keep search)Training opt-out while keeping the citation pathMore per-bot setupMedia, publishers
Block allContent protection, less loadDiscoverability (still not total removal)Sensitive, internal content

Media and publishers, whose content is the asset, may opt out of training while keeping search bots open to preserve citation traffic. Commerce and local businesses, for whom AI-answer citations are an exposure path, lean toward allowing. SaaS and B2B, where docs and blogs are a lead path, usually allow and block only sensitive paths. For a portfolio or personal site it's a matter of taste. These are axes for judgment, not a prescription.

robots.txt examples, per bot.

The key point is that bots are independent. Training bots and search bots are separate tokens, so blocking one doesn't block another. You set each on purpose. OpenAI notes that "each setting is independent — a webmaster can allow OAI-SearchBot to appear in search results while disallowing GPTBot" for trainingOpenAI, 2026.

CompanyTokensPurposerobots.txt
OpenAIGPTBot / OAI-SearchBot / ChatGPT-Usertraining / search / user fetchcontrolled independently
AnthropicClaudeBot / Claude-User / Claude-SearchBottraining / user / searchblocked separately
PerplexityPerplexityBot / Perplexity-Usersearch index / user fetchfirst honors it; the second generally ignores it
GoogleGoogle-ExtendedGemini/Vertex training opt-outno effect on Search or ranking (separate from Googlebot)

Bot definitions and blocking methods follow each vendor's official docs — Anthropic's ClaudeBot, Claude-User, and Claude-SearchBotAnthropic, 2026, Perplexity's PerplexityBot and Perplexity-UserPerplexity, 2026, and Google-ExtendedGoogle Search Central, 2026. To block training bots while keeping search visibility:

``` User-agent: GPTBot Disallow: /

User-agent: ClaudeBot Disallow: /

User-agent: Google-Extended Disallow: / ```

To block everything, add a Disallow: / for the search tokens too (OAI-SearchBot, PerplexityBot, and so on). Anthropic also supports the non-standard Crawl-delay extension, so you can slow a bot instead of blocking it (User-agent: ClaudeBot / Crawl-delay: 1). OpenAI frames disallowing GPTBot as something that "indicates" your content should not be used for training — a signal of intent more than a hard block.

After you set it, verify it.

Reopen robots.txt and confirm the lines match your intent, bot by bot. But because robots.txt is a request, it applies to compliant bots and can be ignored by non-compliant ones. "I set it" and "it's blocked" are not the same statement.

The verify routine is three steps: re-check → compare against intent per bot → run zupzup's four axes. Remember that tokens like Perplexity-User, driven by a user's request, generally ignore robots.txt, so robots alone won't stop them. If you want to see whether your site actually shows up in AI answers, read the companion piece, Is your site showing up in AI search? A 5-minute check routine.

Block or allow — make it a choice.

The order is four steps: check → decide → set → verify. robots.txt is the start of control, not the end of it (its enforcement has limits). Instead of "block everything" fear, choose what fits your business model. Blocking or allowing — let it be a choice you made knowingly.

References

  1. OpenAI, 2026 — Bots (GPTBot / OAI-SearchBot / ChatGPT-User)
  2. Anthropic, 2026 — Block Anthropic crawlers (ClaudeBot / Claude-User / Claude-SearchBot)
  3. Perplexity, 2026 — Perplexity Crawlers (PerplexityBot / Perplexity-User)
  4. Perplexity, 2026 — How does Perplexity follow robots.txt
  5. Google Search Central, 2026 — Google crawlers (Google-Extended)
  6. Google Search Central, 2026 — robots.txt introduction (cannot enforce)
  7. IETF RFC 9309, 2022 — Robots Exclusion Protocol
  8. Vercel × MERJ, 2025 — The rise of the AI crawler

← All posts