DataDome

Bot Proxy Landscape in 2022

Table of contents

The Proxy Landscape

In this landscape report, we explore an important component of any large-scale bot operation: Proxies. What type of proxies do bots use, and what fraction of IP addresses are used as data-center proxies and residential proxies at a given time?

What is an IP address?

An IP (Internet Protocol) address is a set of numbers that can be seen as a device’s address on the internet. It is used to properly route traffic from a device to a website.

But each device doesn’t necessarily have a unique IP address. There can be one user or thousands of users using the same IP address at the same time.

Residential IP addresses provided by ISPs tend to be shared by members of a household, and don’t change frequently. On the other hand, mobile IPs often have thousands of devices with the same address at the same time.

What is a proxy?

A proxy is a program that enables users to change their IP address by routing traffic through someone else’s infrastructure. Proxies can be used by humans for anonymity and privacy purposes, or by malicious bot operators to avoid being blocked.

Indeed, when bot or malicious activity is detected from a given IP address, many online services will block it for a certain duration. To avoid being blocked, bot developers and fraudsters leverage proxies to route their traffic through other IP addresses.

The schema below shows the route taken by an HTTP request, first made without a proxy, and then made with a proxy.

Residential proxy post schema of a request with and without a proxy.

When a user (or a bot) uses a proxy, the request is forwarded by the proxy to the website/mobile application. Thus, for the website, it looks like the request is coming from the proxy IP address instead of the end user’s IP address.

Understanding IPv4 Address Space

In theory, IPv4 space includes ~4.5 billion IPs. After excluding local IPs, there are ~4.22 billion potential public IPv4s.

Among all these IPs, we can distinguish between two classes of IPs:

  1. Data-Center IPs: Located in data centers, the IP address for data-center proxies are tied to the data-center provider, such as Amazon, Google Cloud, etc. Having an IP address tied to a data-center provider is uncommon (though not impossible) for human users, so data-center proxy IPs tend to get blocked, fast.
  2. Residential IPs: Residential IPs belong to well-known ISPs, such as AT&T and Comcast. Since these IP addresses are used by legitimate humans on a daily basis, they tend to have a better reputation than data-center IPs, which can enable attackers to avoid being detected quickly.

In each category, only a fraction of the IPs can be/are used as proxies. The diagram below shows the type of IP addresses in the IPv4 address space.

Proxy Landscape Diagram

The majority of bots leverage residential proxies and data-center proxies to operate at scale.
By using both types of proxy, attackers can easily change their IP to avoid being blocked too frequently. It is possible for an attacker to make bots from their own PC/server IP, but it doesn’t scale well. The IP address will quickly get flagged as malicious and blocked.

Residential proxies are higher quality but more expensive. Since it is the same kind of IP address used by humans, a residential proxy enables a bot developer to be blocked less frequently. Data center proxies cost less, but can be more easily blocked, since they are linked to known autonomous systems that belong to data center providers.

Moreover, data center proxy IPs only tend to be used by bots (although some bots operate on VPNs). On the other hand, the majority of residential proxies are IPs used both by bots and humans. In fact, the devices that run the proxy code are typically also used by legitimate human users. That’s because residential proxies are obtained using the following techniques:

  • Mobile SDK/Software SDK
  • Browser Extensions
  • Infected Devices

Fully private residential proxies do exist. The services rent access to IP addresses that belong to well known ISPs (mostly American) and sub-lease the IPs as proxies. The proxies are located in data centers, but belong to residential autonomous systems (AS).

Estimating the Size of the Proxy Pools

To estimate the size of data-center and residential proxy pools, we use two different approaches:

  • Approach 1: Subscribe to different proxy services and enumerate their IPs.
  • Approach 2: Analyze customer traffic to infer/predict the presence of proxies.

Using approach 1 enables us to have the ground truth. We’re 100% sure an IP address was used as a proxy since we were the one responsible for the requests that go routed through the proxy.

Approach 2 uses supervised machine learning and a set of heuristics (with different kinds of signals linked to the IP behavior, the type of fingerprints used on the IP, etc.) to classify whether or not an IP address has been used as a proxy. Approach 2 enables us to be more exhaustive.

Disclaimer: The estimate we compute below is a snapshot of the situation. Both data center and residential IP addresses are allocated to different people over time. Thus, the statistics we present below are likely to evolve over time.

Estimating the Size Pool of Data-Center Proxies

To estimate the number of data-center IPs, we leverage the two approaches described above:

  1. Considering all data-center IPs from which we made requests using proxy services we subscribe to.
  2. Considering data-center IPs that have been flagged as proxies by our ML models/heuristics and that were used by malicious bots.

In total, we estimate bots leverage ~5.7M distinct data-center IP proxies over 7 days.

The table below shows the distinct data center IP proxies for the top 6 autonomous systems.

Proxy Landscape - Top AS - Table 1

We observe well-known cloud providers, such as Amazon. We also see less common names, such as:

  • Sprintlink
  • HostRoyale technology
  • M247
  • Cogent

Although some of the names may not be familiar to you, they are autonomous systems frequently used by data-center proxies.

Estimating the Size Pool of Residential Proxies

Similarly to data-center proxies, we leverage 2 approaches to estimate the number of residential proxy IPs:

  1. We consider all residential proxy IPs from which we made requests using proxy services we subscribe to.
  2. We consider residential IPs that have been flagged as proxies by our ML models/heuristics and that were used by malicious bots.

In total, we estimate bots leverage ~6.2M distinct residential IP proxies over 7 days.

The table below shows the distinct residential IP proxies for the top 10 autonomous systems.

Proxy Landscape Table 2

We see a lot of American (Comcast, AT&T, Verizon) and European (Orange, Free, Virgin) IPs from well-known providers used as residential proxies by bots to conduct attacks.

In total, we observe more distinct residential IP addresses used as residential proxies (6.2M) than data-center proxies (5.7M).

However, note that the volume of requests originating from data-center proxies is significantly (~1.8x) bigger than the volume originating from residential proxies.

Limitations

The approach we use to estimate the proxy pool size is heavily dependent on:

  1. The proxy providers we subscribe to and their proxy pool.
  2. Our ML models and the heuristics we leverage to flag traffic.
  3. The traffic we observe. 

Note that some IPs on which bots directly operate may not be proxies. For example, someone creating an Amazon VM (virtual machine) can use it directly, not as a proxy. Same for residential IPs—someone can make a bot directly on their PC. We try to account for these exceptions in our models and heuristics, but can’t promise perfection.

Moreover, we only talk about the IPv4 address space here. Although we see and handle bots operating from IPv6 IPs, they remain a minority of the malicious traffic, so we didn’t consider them in this article.

Conclusion

While we see more bots operating from data center IPs in volume of requests, we see that the number of distinct residential IPs used by bots is larger. 

The takeaway? Bot owners are willing to invest in more expensive residential proxies with better reputations than data-center proxies because ROI is high.

Traditional techniques that aim to block data-center IPs are ineffective because bots leverage millions of residential IPs. Moreover, blocking data-center IPs runs the risk of also blocking legitimate human users using VPNs, as well as users behind corporate proxies.

DataDome
dd product home overview

Still exploring?

Start with an on-demand demo.