What is Google-Extended?

Google-Extended is a specialized crawler introduced by Google in September 2023, designed to give website owners control over how their content is used for AI training, separate from standard Google Search indexing. It does not index content for search results but specifically collects data for products like Google Bard, Gemini, and Vertex AI. It would likely be engineered to handle complex data structures and dynamic content more effectively than traditional crawlers, making it particularly useful for indexing JavaScript-heavy sites and single-page applications.

Use cases for such a crawler would include enhanced search engine optimization (SEO) capabilities, improved content relevancy in search results, and more accurate site analytics. Additionally, it could be utilized for competitive analysis and market trend identification by crawling and analyzing large datasets across the web.

The primary benefit of Google-Extended is allowing publishers granular control of their site’s inclusion in Google’s AI training datasets, without impacting how their pages appear in search results.

Why is Google-Extended crawling my site?

How to block Google-Extended?

To effectively block the bot “Google-Extended” from accessing a website, you can implement several server-side strategies that leverage existing web server configurations and custom scripting. Here are five effective methods:

 

1. HTTP Headers Check:
Implement server-side checks to inspect the HTTP headers of incoming requests. Bots often have distinctive patterns in their user-agent strings or other header fields. You can block requests with headers that match those of Google-Extended, for example, in Apache:


RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} Google-Extended [NC]
RewriteRule .* - [F,L]

 

2. IP Address Blocking:
You can block bots by IP if required, but Google-Extended primarily identifies itself using its user-agent string; Google’s crawler IP ranges may change and are officially documented for verification. For instance, in an Apache server:


<Directory "/var/www/html">
Order Allow,Deny
Deny from 192.168.1.0/24
Allow from all

 

Replace `192.168.1.0/24` with the actual IP range.

 

3. Rate Limiting:
Implement rate limiting to prevent rapid access patterns typical of bots. This can be configured at the web server level or through scripting to track and limit request rates from a single IP address or session.

 

4. CAPTCHA Challenges:
While typically used for form submissions, CAPTCHAs may help restrict access from bots that do not comply with robots.txt rules, but most reputable crawlers, including Google-Extended, will rarely attempt to bypass such mechanisms. Implementing a CAPTCHA system requires modifying your site to prompt for verification when suspicious behavior is detected.

 

Each of these methods has its strengths and limitations, and often a layered approach combining several strategies will provide the most robust defense against unwanted bot traffic.

Block and Manage Google-Extended with DataDome

With the advanced technology behind DataDome's Cyberfraud Protection Platform, you can detect and block bots that threaten your website or application. By stopping bots in their tracks, DataDome safeguards your systems from attacks like scraping, account takeover, credential stuffing, and DDoS. This robust protection ensures the integrity of your data and enhances your overall security posture.
DataDome

See which bots and AI agents bypass your defenses

Create your account to start analyzing and mitigating malicious bots and AI-drive threats in real-time