What is web scraping? How does it work?

Scraping

What is web scraping?

Web scraping is the automated extraction of data from web pages by parties other than the website owner. It’s a common practice used for market research, pricing comparisons, and lead generation. Unfortunately, web scraping isn’t always conducted for legitimate purposes.

Cybercriminals use web scrapers, or scraper bots, to mimic regular browsers and access websites by following their hypertext structure. They then extract data according to predefined parameters and store the data in local databases.

While some web scrapers have good intentions, many do not. Because digital processes have become an integral part of the everyday workings of most companies, malicious web scrapers can cause serious damage to businesses.

That’s why we wrote this guide—to present you with in-depth information about web scraping and how you can prevent it.

DataDome - Web Scraping

What problems are caused by web scraping?

When attackers conduct web scraping, they deploy automated bots to extract data from your websites, mobile apps, and/or APIs. This can cause several significant problems that hurt your business, including:

Content theft
Email scraping
Price scraping & undercutting
Degraded website performance
Draining your resources

Content theft

Digital content is essential to any company that operates online. Whether it’s a blog that establishes you as a thought leader, classified ads that bring in revenue, or product descriptions to accompany your e-commerce pages, content has the power to bring users to your company and turn them into loyal buyers.

The problem with content is that it’s hard to produce, but easy to steal. Consider most media websites today. Their primary way of earning money is ad revenue. But without the right protection, the web scraper of a content aggregator can simply scrape all the news from a media website and display it on their own website.

They don’t need to pay journalists, writers, editors, and so on. All they need is a web scraper to help them plagiarize content.

It’s unfair competition that can lead to a Google penalty for duplicate content, lower SEO rankings, a loss of legitimate traffic, and less ad revenue for the affected media company. But the problem is not unique to media companies.

For example, scrapers targeted premium footwear brand Kurt Geiger, the largest luxury footwear retailer in Europe. The company’s DevOps Manager noticed that scraper bots were indexing their content, including product descriptions, images, and prices. The bots were aggressive and came in large numbers, often overloading their backend systems and slowing down their website considerably.

After initially trying Nginx’s rate-limiting feature to cap the number of requests coming from a single IP address, the team at Kurt Geiger quickly realized that most web scrapers attack from many different IPs. So, they turned to DataDome for protection. (DataDome’s bot management solution integrated seamlessly with Nginx.)

Once an allowlist of partners and tools was established, DataDome blocked all illegitimate scrapers and other unwanted bot traffic from the Kurt Geiger website.

Price scraping

Some web scrapers scrape everything on your website. Others, however, go straight for your prices. The malicious actors behind those web scrapers can then use your pricing information to manually or automatically undercut your prices. This is particularly painful for price-sensitive products such as concert tickets, online electronics, and plane tickets.

Even if you hide your prices behind a form, scrapers can be a problem. Many web scrapers are sophisticated enough to fill out forms and access gated content. For example, a car company that only displays its prices once you’ve shared what car you want and when you want it can still be fooled by a web scraper. Toyota Prius tomorrow at 8 AM for $10k? Scrape. Hyundai Elantra this Saturday at 11 AM for $8k? Scrape.

Web scrapers can still target prices hidden behind a form.

Example

Price scraping was a real threat to Kuantokusta at one point. Kuantokusta is Portugal’s leading price comparison site with over three million unique users who compare two million products across 700 different stores. The websites combined are a huge marketplace where users and merchants do business.

In 2015, Kuantokusta had created a product called PriceBench that could collect, process, and analyze product data to make the price management process smarter. By paying for PriceBench, merchants could see what their competitors were doing on a Kuantokusta website and adjust their prices accordingly.

The problem was that some merchants had already built web scrapers to scrape their competitors’ prices without permission from Kuantokusta. If a competitor updated the price of a product from $100 to $99, the scraper would notice and immediately update the merchant’s equivalent product’s price to $98.99.

Web scrapers not only undermined the value proposition of PriceBench, but they took up significant bandwidth on the Kuantokusta website. They also caused embarrassing problems across the marketplace when merchants accidentally added the wrong prices.

Kuantokusta partnered with DataDome to stop their scraping problem. They identified how large the problem was, created an allowlist, and configured DataDome to stop all unwanted web scrapers. As a result, the website became much faster, and PriceBench turned into a viable product paid for by merchants.

Website performance issues

Sometimes, web scrapers come in large enough numbers to take up a significant portion of your bandwidth. It’s not uncommon for bots to represent 70% of a website’s traffic. Not only does this skew your analytics and worsen your ability to make data-driven decisions, but they slow down your website. And users hate slow websites.

Example

Web scrapers were causing performance issues for TheFork, a TripAdvisor company and the leading online restaurant booking platform in Europe, Australia, and Latin America. The website offers two types of content:

Public content such as restaurant opening hours and menus.
Value-added content such as user reviews and table availability.

Web scrapers were stealing both types of content for competitive purposes. Scrapers came in such large numbers that it led to unpredictable traffic peaks and service interruptions for both the website and the mobile app. In turn, TheFork’s hosting and maintenance costs went up.

TheFork tried to block unwanted IP addresses manually, but that was a long and tedious process that didn’t work well enough. So, they turned to DataDome, which integrated with their architecture and allowed TheFork to create custom rules specific to their traffic.

This enabled TheFork to eliminate all unwanted web scraping from their website and app, ensuring that all server resources were dedicated to legitimate site traffic.

Web scrapers drain your resources

Beyond draining your bandwidth, there are many other ways web scrapers can drain your resources. Perhaps the most elevated hidden cost is the waste of your employees’ time. Here at DataDome, we often speak to companies where at least one DevOps Engineer spends at least a full day a week focused on bot problems.

The calculations are straightforward:

The average salary of a DevOps Engineer is $100,000. One day a week means 20% of their time. So, the bot problem creates an opportunity cost of at least $20,000 a year for this one employee (without accounting for taxes, insurance, and other employee costs).

The opportunity cost of work-hours comes on top of the costs in infrastructure, SEO, site performance, etc. we already mentioned. And needless to say, the work can be repetitive and demotivating for your employees.

Example

Bot-related problems were taking up a large chunk of time at Brainly, the world’s largest online learning platform for children and young adults. When the CTO noticed that competitors were beating Brainly for particular search results, he discovered that the competitors were sending web scrapers to Brainly’s website to steal their content.

They decided to turn to DataDome’s solution because it was offered as a Cloudflare app, so integration was a breeze. Bot problems used to consume up to half of the CTO’s day (calculate that opportunity cost), but now, he only checks DataDome’s email report in the morning, and that’s it.

DataDome runs on autopilot. No more web scraping problems.

How does web scraping work?

Now that we’ve covered the biggest reasons why web scrapers are dangerous, we’ll dive into how a web scraper actually works. Let’s start by repeating what a web scraper is: An automated bot that extracts information from your websites, mobile apps, and/or APIs.

The first important point we want to cover is that not all web scrapers are bad. In fact, some are essential to the proper functioning of the websites that play a role in our everyday lives. Good ones tend to be called “web crawlers” instead of web scrapers.

Here’s a list of good crawler bots:

Search engine crawlers: A search engine cannot work without an efficient crawler to discover and index all websites. Examples include Googlebot, the Baidu Spider, Bingbot, and the Yandex bot.

Feed fetcher crawlers: Bots that grab the RSS or Atom feeds of websites to deliver a feed to its users include the Google Feedfetcher, Microsoft’s .NET WebClient, and the Android Framework bot.

Social media crawlers: Crawlers used by social media companies also help make your content visible in their users’ feeds. Examples include the Facebook, Instagram, X, and Pinterest crawlers.

None of these crawlers are inherently bad. In fact, there are some that you definitely want crawling your website. For example, blocking the Googlebot is a bad idea if you want people to find you on Google, and blocking the Facebook crawler is a bad idea if you want people to share your content on Facebook.

It’s unlikely that good bots will cause problems. However, you can block a good bot with a robots.txt file, a simple text file in the root directory of your website that tells bots whether they’re allowed to crawl your website and where they can find an easily crawlable .xml file.

Let’s imagine you have a WordPress website. You don’t want any crawlers to visit the /wp-admin/ page, and you don’t want the Bingbot to crawl your website. This is what your robots.txt file would look like:

Blocking Bingbot from crawling your entire website and all bots from crawling /wp-admin/.

The robots.txt file is one of the ways you can tell whether a bot is good or bad. Good bots tend to respect what’s written in the file, while bad bots do not. The robots.txt file doesn’t actually do anything to stop bots, so bad bots just ignore it. It’s like putting up a sign in your yard that says, “Don’t steal from my house.” Thieves don’t care.

The anatomy of a scraping attack

There are plenty of tutorials and tools that make it easy to build a web scraper. Building a scraper is often how developers test their programming chops in a new language.

Therefore, one reason why it’s so hard to stop scraping attacks is because one scraper can look very different from another. However, bad scrapers follow the same basic principles:

A developer writes the script that makes a scraper, often in Python, Java, Ruby, or Perl. For example, Matthew Gray wrote the Wanderer in Perl. Alternatively, the programming of a scraper can be done entirely with web scraper software.
The developer masks the scraper by either making it look like a good crawler, such as the Googlebot, or by making it look like a human with headless browsers and libraries such as Puppeteer or Playwright.
The developer sets the scraper loose. It goes on to target a URL address and its parameter values. It starts scraping and downloading the HTML of the targeted websites, mobile apps, and/or APIs.
Once the scraper has fetched the data, it will parse, search, reformat, and manipulate it in whatever way the scraper was programmed to do. It then pours the manipulated data into its own database or spreadsheet for future analysis and/or abuse.

While everyone can build a scraper (and it’s not hard), using scrapers effectively for bad purposes is a resource-intensive process. That’s why cybercriminals resort to botnets, computers around the world that have the resources to attack big and small websites with powerful, heavy scraping attacks.

Scrapers are increasingly sophisticated

Scrapers are always adapting and evolving. That’s why in-house solutions intended to stop scrapers from scraping websites often don’t work well enough. Cybercriminals use the latest technologies and are very creative in building scrapers.

For example, our research team discovered that hackers were using the Facebook crawler to attack websites. Many companies allow the Facebook crawler to crawl their websites so users can preview the metadata, such as the page title, meta description, and a thumbnail image, on Facebook.

But at the time, Facebook didn’t have strong protective measures for link previews on their API. So hackers placed a token into a Facebook account and used the Facebook crawler to send thousands of requests a minute to the websites they were attacking.

These attacks were technically indistinguishable from the Facebook crawler, but our team noticed that certain parameters were overrepresented in the URL requests. We notified Facebook, and the company has now improved its rate-limiting feature on their Messenger preview API.

Today, the problem on Facebook has disappeared, but it’s a good example of the creativity of cybercriminals and why even the best in-house solutions struggle to keep up.

Is web scraping legal?

Despite the damage it can cause, web scraping isn’t exactly illegal in and of itself. And as long as the information being scraped is publicly available online, it will be tough to prosecute anyone for scraping your site. Plus, it would be impossible to figure out who is behind every web scraper or take all the attackers to court.

For now, we can assume the law will not protect proprietary content, pricing data, etc.

How can you prevent scraping attacks?

So what can you do in terms of effective web scraping protection? More accurately, how do you protect yourself without costing your business an arm and a leg and without needing to ensure constant maintenance and monitoring?

It starts by understanding which types of content are the most likely victims of a scraping attack:

Media content (such as news articles & blog posts)
Product descriptions
Prices
Customer reviews
Coupons
Classified ads

Any strange activity on such pages is a possible indicator of a scraping attack. Other scraping indicators are user accounts with high levels of activity but no purchases, or competitor pages that slightly undercut your prices or copy much of your content.

Common scraping prevention methods are ineffective

We’ve already explained why a robots.txt file doesn’t stop bad scrapers and why in-house bot management tools struggle to keep up. But there are other ways companies try to stop bots that aren’t very effective on their own:

CAPTCHAs: While traditional CAPTCHAs were a reasonably effective prevention method against bots for many years, CAPTCHAs no longer work alone. Many bots now use CAPTCHA farms that solve CAPTCHAs in real time for very little money. Bots can also avoid CAPTCHAs altogether by convincing websites they’re human.

Making CAPTCHAs harder won’t work, because the harder the challenges get, the more friction they add for real people (take that, Turing test). False positives (any time a human is shown a CAPTCHA) kill conversions, increase bounce rate, frustrate humans, and harm your user experience (UX), all while being easily solved by CAPTCHA farms.

Stand-alone CAPTCHAs are no longer worth the cost.

Web Application Firewalls (WAFs): A good WAF can block familiar threats, such as malicious user agents and IP addresses. But that’s where WAF protection stops. Anything sophisticated will get through. WAFs are no match for today’s real-time, automated scrapers that mask themselves as human or as “good” crawlers.

In addition, WAFs are usually quite IP-centric, which means they rely on IPs to determine whether something is a threat or not. But scrapers use botnets and can rotate through thousands or even millions of IPs to work around a WAF’s IP filters.

Terms and Conditions (T&Cs): Good T&Cs are invaluable to protect yourself against web scrapers, because they can help you in litigation later. For example, thanks to their T&Cs, Ryanair won a case in the EU’s Court of Justice in 2015 against PR Aviation. PR Aviation was scraping Ryanair’s prices without the license agreement that allowed them to do so.

The EU Court decided in favor of Ryanair and, more broadly, indicated that European businesses can use their website T&Cs to prohibit anyone from scraping their data.

The problem with T&Cs alone is that they are reactive and resource-intensive. You can’t go to court for every scraper that targets your website, and it’s not always easy to figure out who’s behind a scraper attack. So while you definitely need protective T&Cs on your website, you also need a preventive solution instead of a reactive one.

The most effective protection against scraping attacks

The best protection from web scraping is a solution that can identify and analyze any visitor’s technical and behavioral parameters in real-time, so it can block malicious scrapers and ticket scalping bots before they start scraping. It needs to work without slowing performance and maintain a smooth experience for human visitors.

That’s what we’ve built at DataDome.

Our bot protection software uses multiple algorithms to protect each specific endpoint against web scraping attacks. Every request to your websites or mobile apps is analyzed and either blocked or authorized in real time. DataDome has a false positive rate of less than 0.01%, which means that real users (and good crawlers such as the Googlebot) are never blocked.

In addition, DataDome doesn’t require any difficult reconfiguration of your IT infrastructure. Our bot protection software deploys in minutes on any web infrastructure, and we have both server-side modules and client-side SDKs for your Android and iOS apps that make integration very easy.

The modules and the SDKs combined collect various device properties and behavioral information to detect and block even the most sophisticated web scrapers. This is done without collecting any personal or identifiable data, which means you don’t have to worry about any compliance requirements.

Finally, because DataDome is uniquely focused on bot and online fraud protection, the solution is constantly up to date on the latest threats. We have dedicated a threat research team/security operations center (SOC) that is always ahead of hackers and their bots.

Case study: Protecting classified ads from scraper bots

LV digital GmbH operates digital products and platforms for the agricultural and rural living sectors. The company was plagued by bad scrapers that were targeting Traktorpool, their main classified ads portal, stealing content and republishing it elsewhere.

The scrapers came in such numbers that it would regularly take the Traktorpool site offline. So, LV digital’s team tried DataDome’s free threat audit, and discovered that scrapers were draining around 25% of their infrastructure’s resources. Deciding to move forward with DataDome, they installed the DataDome module for Varnish.

Today, LV’s portals have 100% uptime. The reduction in bot traffic also meant that the company could scale down their data center capacity, which fully offset the cost of their DataDome subscription.

Prevent web scraping with DataDome

Hopefully, this guide on web scraping has helped you better understand the risks of scraping, and why it’s important to detect and mitigate scraper traffic on your websites, mobile apps, and APIs.

While you may not have noticed any web scraping attacks so far, the rising threat can severely harm your business. Installing a real-time bot protection solution like DataDome will protect you against web scraping without adding friction to your user experience.

DataDome freed up at least 50% of the time of one of my engineers... That time is much better invested now. We have more bandwidth to work on new features that our customers are looking for and that enable us to do more business.

Michael Romer

Head of Product and IT at LV digital

Try DataDome free for 30 days and get a real-time overview with detailed visibility of all automated threats to your mobile apps, websites, and/or APIs. No credit card required. Create a free account here. (No scrapers allowed!)