DataDome

BaaS: The Secret Weapon for Heavily Distributed Scraping

Table of contents

In the past, someone who wanted to scrape a website needed to know how to code bots, forge fingerprints, and rotate proxies in a consistent way. But now, bot as a service (BaaS) providers let users run bots at scale without requiring any bot development or reverse engineering knowledge.

A BaaS is simply a REST API on which a user provides the URL they want to scrape. The business model is simple: Users only pay when their request is successful. Therefore, as long as the request is blocked, the user doesn’t pay for anything. No need to worry about proxy bandwidth, which can become expensive when using residential proxies.

Another way BaaS providers can make money is to increase concurrency/parallelism (i.e. the more requests a user wants to do in parallel, the more they have to pay). BaaS services like ScraperAPI have gained popularity among attackers due to their convenience and cost effectiveness.

How do they work?

A user can send any URL they want to scrape to the BaaS API, along with any further instructions. For example, a user might specify whether or not they want to use residential proxies (which is more costly than data center proxies when successful) and whether or not they want to execute JavaScript (requiring the service to spawn a real/headless browser).

Users can also specify the location of their proxies to have IP addresses located in the same country as the website/application they want to scrape. Thus, to call mywebsite.com/product/a, they could send an API request such as:

https://baas-service.com/?url=https://mywebsite.com/product/a&js=true&residential_proxy=true&location=usa

The request above would enable the user to scrape mywebsite.com/product/a with a real/headless browser using residential proxies located in the US.

In return, the BaaS will start to send one or more request(s) to mywebsite.com/product/aIf mywebsite.com is not protected, the BaaS will automatically get the content of the page and return it to the user. Otherwise, if the website has some protection, the request may get blocked. 

If the request is blocked, the BaaS will make several requests in parallel in an attempt to bypass the protection. For example, they may:

  • Rotate user agent.
  • Spoof new HTTP headers.
  • Change IP address by using new proxies.
  • Forge a CAPTCHA.

If at some point, the BaaS is able to get the content without being blocked, it will return the content to the user. The user only pays for a single API call, even though the BaaS had to do dozens of requests to successfully get the content of the page.

However, if mywebsite.com’s protection completely stops the request, the BaaS will simply return an error message and the user won’t have to pay for anything.

BaaS Schema

BaaS in the Wild

As it gets easier and more cost effective to make advanced bots without being a bot expert, it’s no surprise that DataDome has seen increased BaaS traffic.

The graph below shows an attack conducted by bots operating from a BaaS.

BaaS Attack Graph

In total, the attack lasted ~19h and consisted of ~15.5M requests distributed from >500K residential proxies. On average, the attack generated ~150K/10 min on the customer servers.

As shown on the map below, the attack was heavily distributed on IP addresses coming from different countries:

Moreover, the map shows that most requests came from Europe. That’s because the target is a leading European website. So, the BaaS or the attacker automatically selected IPs located in realistic countries to avoid being blocked too easily.

All requests targeted product pages and product search, using different user-agents:

  • Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4885.3889 Safari/537.36
  • Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.6615.2377 Safari/537.36
  • Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:89.0) Gecko/20100101 Firefox/89.0
  • Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:93.0) Gecko/20100101 Firefox/93.0
  • Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36 Edg/89.0.774.76

In addition to randomizing user-agents, a BaaS will also try to use HTTP headers that are consistent with the user-agent chosen. For example, if an outdated Safari browser is chosen, the BaaS will avoid using a header that’s only available on recent Chrome.

Moreover, we noticed that the BaaS automatically selected a consistent Accept-Language HTTP header (i.e. the value of the header is either consistent with the location of the proxy and/or consistent with the websites being scraped).

To fight against heavily distributed scrapers, it’s important to block at the first request. 

When we showed a CAPTCHA or a block page, we were able to collect further signals using JavaScript. The analysis of these signals shows that the bots were based on Puppeteer extra stealth (https://github.com/berstend/puppeteer-extra/tree/master/packages/puppeteer-extra-plugin-stealth), a popular open source library that enables users to use an alternate version of Puppeteer with a fingerprint modified to be less easily detectable.

The library also has a plugin to automatically forge CAPTCHA using CAPTCHA farmsThat’s not a surprise to us, since Puppeteer extra stealth is heavily used by BaaSes and other bot developers.

The Takeaway

Now that attackers can easily access BaaS options to simplify scraping without spending time or money on unsuccessful attempts, threats are becoming more frequent and intense. To prevent scrapers from stealing data from your website, heavily distributed attacks must be blocked from the very first request. 

Ideal bot protection not only blocks each attack from the first request, but also makes it easy for you to examine your threats in detail to get a better understanding of where they are coming from and how you can best defend your business and customers. To see your bot traffic on a detailed dashboard, try DataDome free.

DataDome
dd product home overview

Still exploring?

Start with an on-demand demo.