Siteimprove Crawl

What is Siteimprove Crawl?

Siteimprove Crawl (Siteimprove Bot) is the web crawler used by Siteimprove’s digital governance platform to scan customer websites. It respects robots.txt, identifies itself via a Siteimprove user-agent, and audits pages for quality, accessibility, SEO, performance, and policy compliance.

Legitimate use cases:
– QA: broken links, spelling, redirects, orphan pages
– Accessibility: WCAG compliance checks
– SEO: metadata, sitemaps, indexing issues
– Performance: page weight, response times
– Governance: content inventory, policy violations

Illicit or fraud-related misuse (by adversaries, not Siteimprove):
– User-agent spoofing to evade bot blocks
– Aggressive crawling for scraping/IP theft or availability impact
– Reconnaissance: enumerating hidden URLs, staging/admin paths, sensitive files
– Email/content harvesting for phishing or social engineering

Mitigations:
– Validate via reverse DNS/IP allowlists from Siteimprove
– Enforce robots.txt plus bot management and rate limiting
– Monitor anomalies; block spoofed UAs with behavioral checks

Why is Siteimprove Crawl crawling my site?

It’s likely running audits on your domain because a stakeholder, vendor, or affiliate added your site to their monitoring, or your content was referenced in a project they track. Expect it to probe all discoverable URLs (including parameterized and deeplinks) to assess site health and compliance. Potential downsides: elevated crawl load causing latency spikes or cache churn, inflated analytics and bot traffic masking fraud/MITM signals, noisy logs that hinder threat hunting, accidental exposure of unlinked or test/staging endpoints, triggering WAF/IDS rules or rate limits that impact real users, and unnecessary bandwidth/CDN costs. If sensitive paths are publicly discoverable (sitemaps, JS-generated links, open directories), they may be enumerated and stored in third‑party systems. Coordinate with internal teams and vendors to confirm who initiated the scans and scope.

How to block Siteimprove Crawl?

1) User-Agent filtering at the web server
Nginx: if ($http_user_agent ~* "Siteimprove Crawl") { return 403; }
Apache:
RewriteEngine On RewriteCond %{HTTP_USER_AGENT} "(?i)Siteimprove Crawl" RewriteRule .* - [F]

2) IP/ASN/network blocking
Block known IP ranges or hosting ASNs used by Siteimprove Crawl if identified and unwanted.

3) Rate limiting and dynamic banning
Use Nginx limit_req or similar to throttle high-frequency requests from this bot; optionally use fail2ban for auto-blocking.

4) JavaScript token + honeypot traps
Require JS-generated signed cookies/tokens; add honeypot URLs and block any Siteimprove Crawl agent that touches them.

Block and Manage Siteimprove Crawl with DataDome

With the advanced technology behind DataDome's Cyberfraud Protection Platform, you can detect and block bots that threaten your website or application. By stopping bots in their tracks, DataDome safeguards your systems from attacks like scraping, account takeover, credential stuffing, and DDoS. This robust protection ensures the integrity of your data and enhances your overall security posture.

TRY FREE

See which bots and AI agents bypass your defenses

Create your account to start analyzing and mitigating malicious bots and AI-drive threats in real-time

Get started

Related SEO & Analytics Bots

See all SEO & Analytics Bots

Bot Name	Operator	Category
Ahrefs Site Audit	Ahrefs	SEO & Analytics Bots
Yandex.Metrica	Yandex LLC	SEO & Analytics Bots
AhrefsBot	Ahrefs Pte Ltd	SEO & Analytics Bots
Site Improve	Siteimprove A/S	SEO & Analytics Bots
botify	Botify SAS	SEO & Analytics Bots

About