Whatever the server – whether you are using Nginx, Varnish, Apache or IIS, log analysis is usually the first step to detect and block bad bots.
Log analysis can be a long, daunting process, but it enables the isolation of any IP showing aggressive or unusual behaviour.
Luckily for SysAdmins, new tools have emerged in the past few years to make this task easier, and more user-friendly. ELK, an open source stack including Elasticsearch for indexation, Logstash for log storage and Kibana for visualization, has grown to become a major player in this field.
Although centralizing and analyzing logs is one of the key best practices for any SysAdmin, current solutions mostly allow them to block nefarious bot activity after the damage has already been done.
Many SysAdmins rely on home-made tools or on the famous Linux-based solution Fail2Ban. Initially developped to block brute force SSH attacks, Fail2Ban also allows users to ban IPs based on established thresholds. Ideal for countering massive attacks conducted through select IPs.
But the lack of threshold differenciation can easily become an issue. Legitimate “good bots” used by search engines can create an important amount of queries within a limited timeframe. The only reliable way to authentify those bots, is to launch a reverse DNS. An expensive operation, difficult to activate in real time.
Moreover, some companies and ISPs can use a single IPs for dozens – if not hundreds – of users, which can lead to the unnecessary blocking of legitimate users. Any form of blocking using only IP as a decisive pattern becomes dangerous, and it is ineffective against content theft and ad fraud.
Beyond the number of hits, IP analysis can provide valuable data:
– IP reputation
– IP location vs. usual website audience
– Nature of the owner (ASN) and range (CIDR blocks): ISP, host, company or organization
– Open ports and hosted services (Web, SSH, FTP)
– Nature of the IP (proxy, Anonymous proxy, TOR)
While rarely sufficient by themselves to make an informed decision, such patterns should be used to fuel real-time monitoring algorithms.
With every query, the browser unveils its name: the UserAgent. A purely declarative element, making it impossible to use for whitelisting (a surprising amount of GoogleBots are crawling through AWS). On the other hand, using them as a blacklisting tool can help block basic bots, amounting to approximately 20% of all bad bot activity. Any webserver – Nginx, Varnish or Apache – can define blocking rules based on the UserAgent.
That being said, it is necessary to go further and analyze UserAgent validity.
Some bots use UserAgent generators, sometimes creating invalid combinations (like IE11 used on Windows XP). A great way to unmask fraudulent activity. Likewise, some statistics show massive traffic coming from browsers such as IE 5.5 or Netscape – an unlikely feat in 2016.
Using cookies or session reconstitution through machine learning, session analysis provides the best insights to ensure optimal bot detection. Analyzing sessions allow us to come as close to the user as possible – and find out whether it’s man or machine. Behavioral analysis at the session level is the most efficient criteria to define blocking patterns, as most legitimate users have a much greater data consumption than bots.
Once again, exceptions exist. Many passionate users can spend countless hours on a single forum thread, or keep track of a product listing for days to follow price evolutions and incoming comments. Used alone, those patterns are hardly sufficient.
It’s hard to let it slide, as this behaviour is extremely close to bot activity used for price monitoring, programmatic bidding or contest fraud.
Instead of simply blocking such queries, how about pushing a captcha – still a great way to authentify “compulsive” web users?
Leveraging big data for real-time protection
Using both cutting-edge technology and human expertise, DataDome has devised a unique detection and identification strategy to help you protect your website from bots. Our goal is simple: to keep your content, your users, your data and your marketing investments as safe as possible.
Once set up, the Dome tracks every hit your website receives, gathering data from each individual user, human or not. This data is compared, in real-time, with complex patterns so as to spot and block the bots. Technical and behavioural elements are used to distinguish real traffic from fraudulent hits.
Technical signs of bot activity
Our propriaritary algorithms start by looking at technical data. Every hit on your website bears precious intel that the Dome immediately analyzes, looking for elements specific to bot activity.
User agent: includes information regarding the browser and the technology used to access your website.
IP owner: also gives us vital information – human-generated traffic mostly comes from ISP and mobile carriers, while bot traffic usually originates from web hosting services.
Geolocation data: a very frequent visitor located outside your geographical market is sometimes a sign of something fishy going on …
Behavioral signs of bot activity
Technical elements are not enough to distinguish bot activity. Behavioural patterns help us understand the visitor’s motivations and spot bots.
Number of hits per IP address: many bots, mostly web scrapers and hacker bots, can crawl thousands of pages in a matter of minutes, looking for relevant content or safety flaws.
Crawling speed (hot volume per minute): A bot can scrape and store a page worth of content in no time. A unique IP address visiting a large number of pages in little time usually indicates fraudulent activity.
Recurring hits: bots follow strict and precise rules, in terms of visits, crawl frequency, etc.
- Hits generating 404 errors: bots looking for security flaws generate random URLs, hoping to detect a breach in the architecture of your website. An IP address generating an unusually large amount of 404 pages might be looking for such a flaw.
Cookies: Unlike humans, bots aren’t submitted to cookie tracking. This means that a returning visitor carrying no cookie information may very well be one of them.
Setting up the solution gives you the opportunity to step back from manual supervision tasks, monitor every hit received by your websites in real time, and automatically block any attempt to access your website coming from a bad bot.
Offered as a SaaS model, DataDome instantly matches every hit to an already massive database (storing more than 2 To of backlogs), enabling the solution to decide in less than 2 milliseconds whether access to your pages should – or shouldn’t – be granted.
Our solution is an ideal tool for SysAdmins wishing to optimize security and performance on their websites, servers and APIs while managing time and resources more efficiently.
Don’t hesitate to reach out to us to receive additional information regarding our product, or to install our free bot traffic detection software in a matter of seconds to start monitoring bot activity and getting a clearer picture of how bots are already impacting your business.