How to Block Web Crawlers From Your Website
AI crawler traffic quadrupled in 2025, with DataDome detecting 1.7 billion requests from OpenAI’s crawlers in a single month. Yet most digital companies have no way of safely managing this traffic. DataDome’s 2025 Global Bot Security Report tested nearly 17,000 high-traffic domains and found that 61% were completely unprotected against simple bot attacks.
The traditional playbook does not work here. Robots.txt was built for a web where crawlers identified themselves and followed the rules. AI bots scrape content without permission, spoof their identity, drain server resources, and expose proprietary data. Blocking them requires a different approach.
Businesses need multi-layer defenses to protect their sites, which includes deploying a bot and agent trust management platform that can detect crawlers and AI traffic by behavior and intent rather than identity alone.
In this guide, we cover what AI bots are, why they are harder to block than regular crawlers, and seven methods to stop them, from simple configuration changes to advanced bot management.
What are AI bots and how do they differ from regular crawlers?
A web crawler is an automated program that visits websites to collect data. Traditional crawlers like Googlebot and Bingbot index content for search engines. They follow predictable patterns, identify themselves honestly, and generally respect robots.txt rules.
AI bots scrape content to train large language models (LLMs) or power real-time AI services like chatbots and search assistants. The major AI crawlers include:
| AI crawler | Operator | User-agent string | Purpose |
| GPTBot | OpenAI | GPTBot/1.0 | Model training |
| ChatGPT-User | OpenAI | ChatGPT-User | Real-time search/chat |
| ClaudeBot | Anthropic | ClaudeBot/1.0 | Model training |
| Google-Extended | Google-Extended | Gemini AI training | |
| CCBot | Common Crawl | CCBot/2.0 | Open dataset for AI training |
| PerplexityBot | Perplexity | PerplexityBot | AI search |
| Meta-ExternalAgent | Meta | Meta-ExternalAgent | AI training/features |
| Bytespider | ByteDance | Bytespider | AI training |
The practical difference comes down to behavior. Search engine crawlers index your content and send traffic back to your site. AI crawlers take your content, use it to train models or generate AI responses, and rarely send visitors your way. Cloudflare data shows that for every referral Anthropic sends back to a website, its crawlers visit roughly 38,000 pages. OpenAI’s ratio is about 400:1.
AI crawlers also hit different parts of your site. DataDome’s 2025 Global Bot Security Report found that AI bots frequently target high-value endpoints: 64% of AI bot traffic goes to form pages, 23% to login pages, and 5% to checkout flows. These are not pages you want uncontrolled bots accessing.
Do AI bots respect robots.txt?
Most major AI companies claim that their AI bots respect robots.txt, but many do not. Robots.txt is a voluntary protocol. There is no technical enforcement. A crawler that ignores your robots.txt file faces no barrier. It simply reads the page anyway.
Some AI crawlers are transparent about their identity and follow the rules. Google-Extended, for example, publishes its IP ranges and officially honors robots.txt. But there is a gap between policy and practice.
DataDome’s research on AI agent policy enforcement found significant cracks in the system. Nearly 89% of domains now disallow GPTBot in their robots.txt files, indicating that businesses are increasingly attempting to block AI traffic over concerns about content theft. But this has done little to reduce unwanted AI traffic, because many crawlers simply ignore the directive, or do not identify themselves at all.
The spoofing problem makes this worse. DataDome’s AI Traffic Report found that known AI agents are being actively impersonated. Meta-ExternalAgent was the most impersonated agent across the first two months of 2026, with 16.4 million spoofed requests. ChatGPT-User saw 7.9 million spoofed requests. PerplexityBot had the highest rate of impersonation, with nearly 2.4% of requests found to be fraudulent. This means that without a sufficient bot and agent trust management tool, you lack visibility into who’s actually behind your AI traffic.
Why you should block unwanted crawlers from your website
Not every crawler is bad. Googlebot helps your site rank in search results. Some AI crawlers may drive traffic through AI-powered search tools. But uncontrolled crawling creates real problems:
- Content theft. AI companies use your content to train models that compete with you. Google reportedly pays Reddit $60 million a year to license user-generated content. Most websites get nothing.
- Server load. AI crawlers generate enormous request volumes. Anthropic’s ClaudeBot aggressively crawls websites, to the point where the traffic can become so intense it resembles a DDoS attack and takes websites offline.
- Data exposure. AI bots hitting login pages, checkout flows, and form pages can extract sensitive business logic, pricing structures, and user-facing data that was never intended for bulk collection.
- Distorted analytics. When a significant portion of your traffic comes from unidentified bots, your conversion rates, bounce rates, and engagement metrics become unreliable. Crédit Agricole Personal Finance & Mobility (CAPFM) found that bot traffic was distorting their analytics across multiple web properties. After deploying DataDome, they restored clean, reliable analytics and cut malicious traffic by nearly 40%.
How to block web crawlers and AI bots: 7 methods from basic to advanced
Each method below adds a layer of protection. No single method is sufficient on its own. We start with the simplest and work toward comprehensive solutions.
1. Robots.txt disallow rules
Robots.txt is the most basic way to tell crawlers not to visit your site. Add disallow rules for each AI crawler you want to block. Here is a robots.txt configuration that blocks the most common AI crawlers:
User-agent: GPTBot Disallow: / User-agent: ChatGPT-User Disallow: / User-agent: ClaudeBot Disallow: / User-agent: anthropic-ai Disallow: / User-agent: Google-Extended Disallow: / User-agent: CCBot Disallow: / User-agent: PerplexityBot Disallow: / User-agent: Bytespider Disallow: / User-agent: Meta-ExternalAgent Disallow: /
Place this file at the root of your domain (e.g., yoursite.com/robots.txt).
Important: Blocking Google-Extended does not affect your Google Search rankings. It only prevents Google from using your content for Gemini AI training. Googlebot (which handles search indexing) is a separate crawler.
Why this is not enough: Robots.txt is a suggestion, not a wall. Any bot can read the file and then access the pages anyway. It also does nothing to stop bots that do not identify themselves or that spoof another crawler’s name.
2. Meta tags and HTTP headers
You can add meta tags to individual pages or use HTTP response headers to signal that your content should not be used for AI training. Add this to your HTML <head> to tell AI crawlers not to use your content:
<meta name="robots" content="noai, noimageai">
Or add HTTP response headers at the server level:
X-Robots-Tag: noai, noimageai
Some publishers also use the emerging ai.txt standard, which works similarly to robots.txt but specifically addresses AI usage rights.
Why this is not enough: Like robots.txt, these are signals that depend on the crawler choosing to comply. A bot that ignores robots.txt will ignore meta tags too.
3. User-agent filtering at the server or WAF level
Instead of asking crawlers to leave, you can actively block them based on their user-agent string. This is done at the web server or web application firewall (WAF) level. In Nginx, for example:
if ($http_user_agent ~* (GPTBot|ClaudeBot|CCBot|PerplexityBot|Bytespider)) {
return 403;
}
In Apache .htaccess:
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} (GPTBot|ClaudeBot|CCBot|PerplexityBot|Bytespider) [NC]
RewriteRule .* - [F,L]
This is a real block, not a suggestion. A request that matches a blocked user-agent string gets a 403 Forbidden response.
Why this is not enough: Sophisticated bots spoof their user-agent strings. A scraper can claim to be a regular Chrome browser, and your server-level filter will let it through. DataDome’s research found that 80% of AI agents do not declare themselves properly when visiting websites.
4. IP address blocking
Some AI companies publish the IP ranges their crawlers use. You can block traffic from those ranges at the firewall level. OpenAI, for example, publishes its GPTBot IP ranges. You can verify a request truly comes from OpenAI by doing a reverse DNS lookup on the IP address.
Why this is not enough: Many AI crawlers do not publish their IP ranges. Smaller AI companies and custom scrapers rotate through thousands of IP addresses using botnets, residential proxies, and cloud services. Blocking individual IPs becomes a game of whack-a-mole.
5. Rate limiting and throttling
Rate limiting restricts how many requests a single IP address or session can make within a time window. This does not stop crawlers entirely, but it slows them down enough to reduce their impact on your servers.
Most WAFs and CDNs (Cloudflare, AWS WAF, Akamai) have built-in rate limiting. You can set thresholds like “no more than 60 requests per minute from a single IP.”
Why this is not enough: AI crawlers that use distributed infrastructure can spread requests across thousands of IPs, staying under each individual rate limit. Rate limiting also risks affecting legitimate users during traffic spikes.
6. Honeypots and tarpits
A honeypot is a hidden page or link that real users would never visit, but crawlers will follow. Once a bot hits the honeypot, you can fingerprint it and block it.
A tarpit takes this further. Instead of blocking the bot outright, it feeds it an endless stream of junk content or a maze of fake links, wasting the crawler’s resources. Open-source tools like Nepenthes and Anubis are built specifically for this purpose. Cloudflare’s AI Labyrinth works similarly at the CDN level.
Why this is not enough: Tarpits work well against unsophisticated crawlers, but advanced bots can detect and avoid honeypots by analyzing link patterns and page structure. Tarpits also require setup and maintenance.
7. Bot management platforms
Bot management platforms provide the most comprehensive AI bot protection because they do not rely on any single signal. Instead, they analyze every request in real time using hundreds of signals: TLS fingerprints, behavioral patterns, device characteristics, IP reputation, and more.
This approach catches bots that spoof their user-agent, rotate IPs, and mimic human behavior. It also makes it possible to distinguish between crawlers you want to allow (like Googlebot) and those you want to block, without maintaining manual rules.
PayPal, for example, uses DataDome to stop AI-powered bots at the network edge before they reach internal systems. As Dan Ayash, PayPal’s Director of Advanced Cybersecurity Solutions, explained: “To fight AI-driven bots, you have to understand what they’re trying to do, not just who they are.”
Travel platform Dohop faced a similar challenge. Scraping bots were flooding its booking engine and generating unnecessary API calls to its 75+ airline partners. Manual firewall rules could not keep up. After deploying DataDome, Dohop cut bot traffic by 70% during peak travel season, blocking over 3 million malicious requests in a single month.
Why is blocking AI bots harder than blocking regular crawlers?
Blocking AI bots is harder because they do not play by the same rules as traditional crawlers. They spoof their identity, rotate through thousands of IP addresses, simulate real browser sessions, and adapt their behavior when they encounter resistance. Standard defenses like robots.txt, user-agent filtering, and IP blocking were designed for crawlers that identify themselves honestly and follow access policies. AI bots routinely do neither. Here is what makes them so difficult to stop.
Identity spoofing is widespread. DataDome’s AI Traffic Report recorded 7.9 billion AI agent requests in January and February 2026 alone, and found that known agents are routinely impersonated. A spoofed PerplexityBot or ChatGPT-User string turns your allowlist into an open door.
Agentic browsers blur the line between bots and users. Beyond traditional crawlers, agentic browsers now simulate full browser sessions, rendering JavaScript and interacting with pages in ways that are difficult to distinguish from real users.
AI crawlers adapt in real time. As Jérôme Segura, VP of Threat Research at DataDome, put it: “[AI agents] mimic human behavior, spawn synthetic browsers, bypass CAPTCHAs, and adapt in real time. Traditional defenses, built to spot static automation, are collapsing under this complexity.”
Manual rules cannot keep pace. Before deploying DataDome, CAPFM’s cybersecurity team was managing dozens of WAF configuration files and rule sets. “Each new bot pattern meant new manual adjustments,” said Kilian Chiarelli, Cybersecurity Engineer at CAPFM. “It worked, but it was time-consuming.” That is why CAPFM switched to DataDome’s adaptive, AI-powered approach that evolves with the threats.
The protection gap is growing. DataDome’s Global Bot Security Report found that only 2.8% of websites were fully protected in 2025, down from 8.4% in 2024. As AI crawlers become more sophisticated, legacy defenses are falling behind.
How does DataDome protect against AI crawlers and bots?
DataDome is a bot and agent trust management platform, recognized as a G2 Leader across Bot Mitigation, DDoS Protection, Fraud Prevention, and Web Security Software. It analyzes every request to your website, app, or API and determines in under 2 milliseconds whether that request comes from a human, a legitimate bot, or a malicious automated threat.
What makes DataDome’s approach different from the methods listed above:
Intent-based detection. DataDome does not rely on user-agent strings or IP lists alone. Its AI engine evaluates intent by analyzing thousands of signals per request, including TLS fingerprints, device characteristics, behavioral patterns, and network reputation. This catches bots that spoof their identity or mimic human behavior.
AI crawler classification. DataDome automatically groups all LLM crawlers and AI agents into a dedicated category, giving you visibility into which models are accessing your content, how often, and for what purpose. You can then set policies per crawler: allow, block, challenge, or throttle.
Granular control without manual rules. Rather than maintaining static blocklists, DataDome’s detection models adapt as threats change. CAPFM uses this to gradually allowlist AI user agents based on observed behavior, only granting access after a 30-day monitoring period.
Real-time protection at scale. DataDome processes 5 trillion signals daily and stops over 350 billion attacks annually. Trusted by companies like PayPal, Tripadvisor, Etsy, and SoundCloud, it deploys in minutes on any web architecture and runs on autopilot.
DataDome Intel, a public database of AI crawlers, automation tools, and spoofing frameworks, is freely available to any security team that needs visibility into what is hitting their infrastructure.
Web crawler FAQs
Add User-agent: GPTBot followed by Disallow: / to your robots.txt file. For stronger enforcement, block GPTBot’s user-agent string at the server level and verify requests against OpenAI’s published IP ranges using reverse DNS lookups.
Yes. Modern AI-powered bots can solve traditional CAPTCHAs faster and more accurately than humans. That is why behavioral analysis and device fingerprinting have largely replaced CAPTCHA puzzles as the primary bot detection method.
Yes. Website owners have the right to control access to their servers. Blocking crawlers through robots.txt, server configuration, or bot management tools is a standard and widely accepted practice. Some jurisdictions have also ruled that ignoring a site’s robots.txt or terms of service can constitute unauthorized access.