How to Detect Web Scraping Attacks

Scraping

What is scraping?

Web scraping attacks, or simply “scraping”, occur when bots automatically collect data from your website, often for malicious purposes like content reselling and price undercutting. Scraper bots mimic real users on regular browsers to access websites, where they extract the data the bot programmer wants to store in local databases.

How do scrapers differ from other types of bots?

Contrary to bots like scalpers, which don’t need to perform many requests, scrapers frequently need to make millions of requests to scrape web pages. Scraper bots are designed to be profitable despite the high volume of requests they execute.

For example, scrapers may use lower quality proxies than scalper bots, but they tend to use the same underlying technologies as other bots. Scrapers can be based on automated (headless) browsers, or they can leverage HTTP clients like aiohttp and Axios.

Scraper bots can be custom made, or fraudsters can leverage different specialized frameworks, such as Scrapy, to make the creation of scrapers easier.

A lot of bots-as-a-service (BaaS) are also specialized for scraping—all they have to do is use an API to scrape websites.

How are scraping attacks detected?

Like other types of attacks, scraping attacks can be detected using three main types of signals:

Behavior
Reputation
Signature

Behavior

Behavioral signals can be collected on both the server side and the client side.

On the server side, the engine analyzes how a user is browsing a website or mobile app to detect suspicious outliers in the number of requests over time—because a bot can make requests much faster than any human.

On the client side, JavaScript (for websites) or an SDK (for mobile apps) will collect details on events in the browser, such as clicks, touch events, typing speed, and mouse movements. These details can then be analyzed by machine learning (ML) models to detect whether the interactions are consistent with human behavior.

Reputation

Reputational signals are only computed on the server side, at different levels of granularity (like IP address or user session) and time windows (like minutes, hours, days, or months).

With reputational signals, detection engines can use prior knowledge to adjust decisions. For example, if a certain autonomous system is often linked to scraping, ML models will decide to be more aggressive with blocking traffic coming from that system.

Since scrapers need to scale their attacks in order to scrape thousands or millions of pages, they tend to heavily rely on proxies. The most advanced scrapers use residential proxies to access IP addresses that are similar to human users. That’s why it’s important to be able to detect proxies to stop scrapers.

Signature

Signature signals are collected both on the server side and the client side, and can include:

HTTP Fingerprints: Details on the HTTP headers (server side).
TLS Fingerprints: Metadata extracted during the TLS handshake (server side).
Browser Fingerprints: JavaScript (JS) collects information about the operating system (OS), browser, and device (client side, in the browser).
Mobile Fingerprints: An SDK collects information about the OS and device (client side, in a mobile application).

The most thorough detection will always leverage browser and mobile fingerprints, because advanced solutions—made with JS or an SDK—can detect popular headless browsers and automation frameworks, like headless Chrome, Puppeteer, Playwright, and Selenium.

Client-side challenges can also help detect and track modified bot frameworks frequently used by scrapers that aim to bypass traditional bot detection techniques, especially:

Questions & Misconceptions About Scraping Detection & Protection

Most websites and mobile applications implement countermeasures against scrapers, which can include CAPTCHAs, rate limiting, web application firewalls (WAFs), etc. But some of the common countermeasures are not enough to protect your website against sophisticated scrapers—and worse, some of them might engender false positives.

Are traditional CAPTCHAs enough against scrapers?

No. Most scrapers can forge traditional CAPTCHAs, using AI-based image or audio recognition or CAPTCHA farms in which human workers solve CAPTCHA challenges on behalf of bots.

On top of that, showing CAPTCHAs to your users (aka “false positives”) significantly degrades the user experience of your real human users.

Does using IP-based rate limiting on my website and API endpoints keep me safe?

While IP-based rate limiting can stop the most simple bots (ones that operate from just one or a few IPs), it won’t catch the most sophisticated scrapers. Sophisticated scrapers leverage proxies to distribute their attacks across thousands of different IP addresses. Thus, each IP address makes only a few requests—which enables the attacker to stay under the rate-limiting threshold.

Moreover, blocking an entire IP address is dangerous because many IP addresses are heavily shared. In fact, most mobile IP addresses are shared by hundreds or thousands of users at any given time. Thus, blocking the IP can result in many false positives (challenging real human users) that hurt your UX and frustrate your consumers.

Is blocking all traffic from data center IPs enough to stop scrapers?

Unfortunately, blocking all data center IP traffic is not enough, and worse, it will trigger false positives. A lot of legitimate traffic originates from data center IPs, including VPN users and big corporate proxies. You don’t want to block your legitimate users.

On top of that, attackers have access to millions of residential proxies—not just data center proxies. Several proxy services provide access to residential IPs for a few dollars per gigabyte of bandwidth. Thus, attackers can use IPs that belong to well known internet service providers (ISPs) such as Comcast, AT&T, and Verizon—just like your real users.

If I use geoblocking to block all traffic coming from countries where my business does not operate, can scrapers bypass it?

While geoblocking may stop simple scrapers operating from a single IP or from foreign data center proxies, it won’t stop more sophisticated attackers that leverage residential proxies. Residential proxy networks allow fraudsters to select proxies that are located in specific countries.

What we observe at DataDome is that most attackers select proxies that are located in the same country as the websites they target—which help them appear more human and bypass geoblocking techniques.

Geoblocking also engenders false positives, as some of your users may be traveling or living abroad temporarily. Moreover, keep in mind that IP address location is not 100% accurate. Thus, there might be some country misclassification in the IP location database, and geoblocking can create false positives on these IPs.

Can my WAF prevent scraping?

No, not entirely. WAFs are no match for today’s sophisticated scraper bots (and may have difficulty preventing price scraping) because WAFs are designed to detect and filter malicious traffic using a set of binary rules. Although yesterday’s simple bots and known threats may be bound by the rules designated in your WAF, scrapers now have easy access to sophisticated bots that use proxies and ML to mimic human behavior.

Today’s sophisticated scrapers can easily bypass rules-based security tools like WAFs.

Detect & Prevent Scraping Attacks With DataDome

Scraping is one of the most widespread bot attacks on the internet today. Even if you haven’t noticed scraping attacks on your website, mobile app, or API yet, scrapers are the most common type of bot attack on the internet. Our data shows that scraping is emerging as a gateway threat, leading to higher-impact attacks, such as scalping.

A real-time bot protection solution like DataDome will detect scraping attacks and protect you—all without adding friction to the user experience.Try DataDome free for 30 days and get a real-time overview with detailed visibility of all automated threats to your mobile apps, websites, and/or APIs. No credit card required. Create a free account here. (No scrapers allowed!)