Siteimprove Crawl
What is Siteimprove Crawl?
Siteimprove Crawl (Siteimprove Bot) is the web crawler used by Siteimprove’s digital governance platform to scan customer websites. It respects robots.txt, identifies itself via a Siteimprove user-agent, and audits pages for quality, accessibility, SEO, performance, and policy compliance.
Legitimate use cases:
– QA: broken links, spelling, redirects, orphan pages
– Accessibility: WCAG compliance checks
– SEO: metadata, sitemaps, indexing issues
– Performance: page weight, response times
– Governance: content inventory, policy violations
Illicit or fraud-related misuse (by adversaries, not Siteimprove):
– User-agent spoofing to evade bot blocks
– Aggressive crawling for scraping/IP theft or availability impact
– Reconnaissance: enumerating hidden URLs, staging/admin paths, sensitive files
– Email/content harvesting for phishing or social engineering
Mitigations:
– Validate via reverse DNS/IP allowlists from Siteimprove
– Enforce robots.txt plus bot management and rate limiting
– Monitor anomalies; block spoofed UAs with behavioral checks
Why is Siteimprove Crawl crawling my site?
It’s likely running audits on your domain because a stakeholder, vendor, or affiliate added your site to their monitoring, or your content was referenced in a project they track. Expect it to probe all discoverable URLs (including parameterized and deeplinks) to assess site health and compliance. Potential downsides: elevated crawl load causing latency spikes or cache churn, inflated analytics and bot traffic masking fraud/MITM signals, noisy logs that hinder threat hunting, accidental exposure of unlinked or test/staging endpoints, triggering WAF/IDS rules or rate limits that impact real users, and unnecessary bandwidth/CDN costs. If sensitive paths are publicly discoverable (sitemaps, JS-generated links, open directories), they may be enumerated and stored in third‑party systems. Coordinate with internal teams and vendors to confirm who initiated the scans and scope.
How to block Siteimprove Crawl?
1) User-Agent filtering at the web server
Nginx: if ($http_user_agent ~* "Siteimprove Crawl") { return 403; }
Apache:
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} "(?i)Siteimprove Crawl"
RewriteRule .* - [F]
2) IP/ASN/network blocking
Block known IP ranges or hosting ASNs used by Siteimprove Crawl if identified and unwanted.
3) Rate limiting and dynamic banning
Use Nginx limit_req or similar to throttle high-frequency requests from this bot; optionally use fail2ban for auto-blocking.
4) JavaScript token + honeypot traps
Require JS-generated signed cookies/tokens; add honeypot URLs and block any Siteimprove Crawl agent that touches them.
Block and Manage Siteimprove Crawl with DataDome
See which bots and AI agents bypass your defenses
Create your account to start analyzing and mitigating malicious bots and AI-drive threats in real-time