Block AI Crawlers With robots.txt: GPTBot, ChatGPT, Others
AI companies are training large language models on web content, and their crawlers visit millions of websites to collect that training data. As a website owner, you have the right to opt out. The robots.txt file is the standard mechanism for blocking AI training crawlers, and most major AI companies have publicly documented their crawler user-agent strings. This guide lists every major AI crawler, explains what each one does, and gives you the exact robots.txt rules to block any or all of them.
Why Block AI Crawlers?
AI companies scrape publicly accessible websites to build training datasets for their language models. This raises several legitimate concerns for website owners. Content ownership: If you are a writer, journalist, artist, or business, your original content is your intellectual property. When an AI model is trained on it, that model can generate text that closely mimics your style, reproduce your factual research, or synthesize information from your content without attribution or compensation. Many content creators want to opt out of contributing involuntarily to commercial AI training. Competition: AI-powered search features (like Google AI Overviews and ChatGPT web browsing) increasingly answer questions directly rather than sending users to source websites. If an AI crawler harvests your content to answer questions, users who would have visited your site may instead get their answer from the AI without ever clicking through. Some publishers see this as an existential traffic threat. Server load: Large-scale AI crawls can generate significant server load, consuming bandwidth and increasing hosting costs, especially for high-traffic sites being crawled aggressively. Data privacy: Some sites publish content that, while publicly visible, the operator prefers not to have included in AI training sets — internal community content, user-generated content, or content intended for a specific audience. Blocking AI crawlers does not affect your site's visibility in traditional search results. Google's standard search crawler (Googlebot) is separate from its AI training crawler (Google-Extended). You can block AI training crawlers while keeping your site fully indexed and visible in Google Search.
The Complete List of AI Crawler User-Agents
Here are the documented user-agent strings for major AI training and AI feature crawlers as of 2026. OpenAI User-agent: GPTBot — Used for training ChatGPT and future OpenAI models. Documented at openai.com/gptbot. User-agent: ChatGPT-User — Used for the ChatGPT web browsing feature (real-time search, not training data collection). User-agent: OAI-SearchBot — OpenAI's search indexing bot for ChatGPT search. Google User-agent: Google-Extended — Google's dedicated AI training crawler, separate from standard Googlebot. Blocks training of Gemini and Google AI features without affecting Google Search indexing. User-agent: Googlebot — Standard Google search crawler. Do NOT block this unless you want to remove your site from Google Search entirely. Meta User-agent: FacebookBot — Used by Meta for AI features, including training data for Llama models. Common Crawl User-agent: CCBot — Common Crawl is a non-profit that publishes open web crawls widely used as AI training data by many organizations including OpenAI, Google, and others. AnthropicUser-agent: anthropic-ai — Anthropic's web crawler for AI training. User-agent: ClaudeBot — Anthropic's Claude AI assistant web browsing. Other AI crawlers: User-agent: Applebot-Extended — Apple's AI training crawler (separate from standard Applebot for Siri/Spotlight). User-agent: Diffbot — Used by Diffbot's knowledge graph and AI data extraction. User-agent: ImagesiftBot, PerplexityBot, YouBot, cohere-ai — Various other AI platforms with web crawlers.
robots.txt Rules to Block AI Crawlers
Here are ready-to-use robots.txt configurations for different levels of AI crawler blocking. Option 1: Block all AI training crawlers, keep standard search crawlers. This is the most common approach — block training bots while maintaining full Google Search and Bing Search visibility. User-agent: GPTBot Disallow: / User-agent: ChatGPT-User Disallow: / User-agent: OAI-SearchBot Disallow: / User-agent: Google-Extended Disallow: / User-agent: FacebookBot Disallow: / User-agent: CCBot Disallow: / User-agent: anthropic-ai Disallow: / User-agent: ClaudeBot Disallow: / User-agent: Applebot-Extended Disallow: / User-agent: Diffbot Disallow: / User-agent: PerplexityBot Disallow: / User-agent: * Disallow: /admin/ Sitemap: https://yourdomain.com/sitemap.xml Option 2: Block all bots except Google and Bing (maximum restriction). User-agent: * Disallow: / User-agent: Googlebot Allow: / User-agent: Bingbot Allow: / Sitemap: https://yourdomain.com/sitemap.xml Option 3: Block only OpenAI crawlers (minimal AI restriction). User-agent: GPTBot Disallow: / User-agent: ChatGPT-User Disallow: / User-agent: * Disallow: /admin/ Sitemap: https://yourdomain.com/sitemap.xml
How Effective Is robots.txt Blocking for AI Crawlers?
The effectiveness of robots.txt blocking for AI crawlers depends on the company's compliance. Major, reputable AI companies publicly commit to respecting robots.txt. OpenAI has documented GPTBot and stated it respects robots.txt. Google has documented Google-Extended and confirmed it respects the Robots Exclusion Protocol. Anthropic respects robots.txt for its crawlers. However, compliance is voluntary. Smaller or less reputable AI operators may not honor robots.txt. Academic and research scrapers, data brokers, and some content aggregators routinely ignore it. robots.txt is an honor system, not a technical barrier. For content you have already published publicly, robots.txt cannot retroactively remove it from AI training datasets that have already been built. It can only prevent future crawling and future training data collection. If your content has already been scraped and included in a training set, robots.txt changes will not remove it from that dataset. For a stronger signal, some AI companies also provide explicit opt-out mechanisms beyond robots.txt. OpenAI has an operator policy for ChatGPT that lets website owners configure permissions through an allow-list or deny-list. Meta AI and some others have similar opt-out procedures in their documentation. For maximum coverage, combine robots.txt blocking with any available platform-specific opt-out processes. Monitoring is difficult — you cannot easily verify whether a specific AI company is respecting your robots.txt in real time. Review your web server access logs and look for user-agent strings matching known AI crawlers to see whether they are still sending requests after you have disallowed them.
Frequently Asked Questions
- Does blocking Google-Extended affect my Google Search rankings?
- No. Google-Extended is separate from the standard Googlebot crawler. Blocking Google-Extended only prevents Google from using your content to train its AI models (Gemini and AI Overviews). It does not affect how Googlebot indexes your pages for Google Search results. Your search rankings, crawl frequency, and indexation status are entirely controlled by Googlebot, which ignores Google-Extended rules. You can safely block Google-Extended while maintaining full Google Search visibility.
- If I block AI crawlers in robots.txt, will my site be excluded from AI answers?
- Not necessarily — it depends on the type of AI crawler you block and when. Blocking training crawlers like GPTBot prevents OpenAI from using your future content in new training datasets, but it does not remove your content from already-built models. Blocking ChatGPT-User prevents the ChatGPT browsing feature from visiting your site in real time, which may reduce how often ChatGPT cites your content for current-events queries. Some AI products may still access content they have already indexed before you added the block.
- Where can I find an up-to-date list of AI crawler user-agents?
- The most reliable source is each AI company's official documentation. OpenAI documents GPTBot at openai.com/gptbot. Google documents Google-Extended in its crawler documentation at developers.google.com. A regularly updated community resource is the dark-patterns-in-ai repository on GitHub, which tracks known AI crawler strings. Because new AI products and crawlers appear frequently, check these sources periodically — the list in any static guide (including this one) may become incomplete over time.