DataDome

How to Use Your Robots.txt to (Even Partially) Block Bots From Crawling Your Site

Table of contents

Search engines use automated programs called robots, or bots for short, to gather information from websites. The information they collect is stored in an index, a database that helps the search engine quickly retrieve relevant pages.

When a user inputs a search engine query, the search engine retrieves results from its indexed database. Algorithms determine the relevance of the results by evaluating factors such as keyword matches, page quality, and user engagement metrics. Indexing ensures that the search engine can deliver accurate and fast results based on its analysis of the crawled web pages.

The search engines bots (also known as web crawlers) will check the site’s robots.txt file to determine which pages they are allowed to access and index. If the site’s robots.txt file lack clear instructions, the web crawler may search and index every page. This can have several negative impacts for the user experience or for the site’s SEO performance:

  • Low-priority pages, such as login pages or terms and conditions, may achieve higher search engine rankings than high-value content, such as blog pages or your home page.
  • Users may not see the most relevant content in their search results. For example, if the site does not block web crawlers from accessing outdated pages, users might see them instead of the most current pages.
  • Duplicate content can appear in searches, for example, a test page in the sitemap that mirrors another page may be indexed.
  • Without “disallow” instruction in the robots.txt file, web crawlers can overload servers by crawling unnecessary pages, causing performance issues for users.

What is a robots.txt file?

The robots.txt file is a simple text file located in the root directory of a website domain. It provides instructions that guide search engine web crawlers on how to interact with the website’s pages. The directives in the robots.txt file apply to all pages on a site, including HTML, PDF, and other non-media formats indexed by search engines.

Directing the search engine bots to relevant pages is a crucial aspect of search engine optimization (SEO). Doing so makes sure that only high-quality, up-to-date pages are indexed and can be ranked in search results. Pages that are not indexed are harder for users to find, since search engines won’t link them to user queries via keywords.

For example, stopping a web crawler from indexing a page about an out-of-date offer will lower the rankings of it in search engine results.

The “disallow” directive in the robots.txt file is used to block specific web crawlers from accessing designated pages or sections of a website.

Optimizing robots.txt with the “disallow” directive can also help reduce the load on a website’s server. When web crawlers access a website too frequently, or all at once, they can generate a large number of requests in a short period. This can put significant strain on a server’s capacity.

Crawling resource-intensive pages such as videos, high-resolution images, or pages that update data in real-time also puts an added load onto the server. When crawlers are directed away from resource-intensive page, it preserves the server’s processing capacity allowing for faster site performance. This results in faster page loading, more responsive user interactions, and improved efficiency in managing dynamic elements like databases.

It’s important to note that this measure should only be applied if heavy web crawler traffic is causing slow performance for users. If web crawler traffic isn’t slowing down site performance, it’s not necessary to restrict access to these pages.

Here is an example of a simple robots.txt file using the “disallow” directive:

In this example, the robots.txt file is blocking Googlebot (the user-agent) from accessing URLs that begin with https://example.com/nogooglebot/.

A slightly more complex robots.txt file might look like this:

In this example, robots.txt is blocking the user-agents Googlebot and Bingbot. It also disallows all web crawlers’ access to any pages with the strings /private.html or /special-offers.html. The asterisk character * acts as a wildcard in this case.

Good to know: What is a * wildcard?

A wildcard is a character that represents one or more unspecified characters in a search or pattern. In this case, all web crawlers are blocked from crawling pages with: /private.html or /special-offers.html by the use of the asterisk wildcard character. 

In some cases, robots.txt can be configured with the crawl-delay directive. Crawl-delay limits how often a bot can visit a site and request pages to index. Crawl-delay stops bots from overwhelming a site if it has limited server resources or a lot of resource-intensive pages. It ensures the server can handle the traffic without slowing down or crashing. To implement crawl-delay, add ‘Crawl-delay: 10’ in the robots.txt file. The number specifies the delay in seconds between bot requests.

How to format and what to include in a robots.txt file

There are two main terms to be aware of when configuring a robots.txt file:

  • User-Agents: User-agents are simply the names that web crawler bots use to describe themselves. To block indexing robots, for example, Googlebot or Bingbot, just put the user-agent name into the user-agent line of your robots.txt, just like in the above example: https://example.com/nogooglebot/.
  • Allow and Disallow: The ‘allow directive’ and the ‘disallow directive’ describe which specific pages in the xml sitemap bots can and cannot crawl. File names and paths specified in the disallow or allow directives are case-sensitive. Adding the forward slash to the allow directive or the disallow directive makes the command applicable to the entire site.

To block a specific URL, use the disallow directive as below:

To block specific files, you must specify the file path. In this case, a PDF file:

To block specific user-agents, make sure to target the bots by user-agent name:

It is also possible to block multiple bots or URLs using the asterisk character:

Instructing robots not to access specific web pages doesn’t remove them or stop users from accessing them. Web crawlers can follow external links to index pages, even if the content has been blocked from direct crawling on the original website. For example, if an external blog content has linked to an old blog page on your site, the crawler can follow the link.

The robots.txt file is also publicly accessible, so any user can see what content is being restricted, although they cannot change the restrictions. For these reasons, robots.txt is not the best way to hide sensitive information from the public.

It’s advisable to add a noindex meta tag and password protection to any pages you want to keep completely private, for example, admin panels or user account pages.

The noindex tag is placed within the HTML <head> section and explicitly tells search engines not to index the content of that page.

Understanding the top four search engine bots

The top four search engine bots are:

  • Googlebot (Google)
  • Bingbot (Bing)
  • Slurp (Yahoo)
  • DuckDuckBot (DuckDuckGo)

Each one of these bots has a different way of reading and respecting the rules outlined in a website’s robots.txt file.

Googlebot, for instance, will follow the rules outlined in a robots.txt file but it is programmed to have a high frequency of crawling activity. Googlebot may ignore directives if there are minor formatting errors or issues in the robots.txt file.

As an example, Disallow: /private/ might not work if the directory is listed as /Private/ in the URL, as the file paths are case-sensitive in some systems. Googlebot is designed to index as much content as it can. This means that even if Googlebot has been restricted in a robots.txt file, it may still find and index pages via external links from other websites or cached versions.

Bingbot also follows the directives in robots.txt but is slightly more lenient when dealing with minor errors. However, if the directives are not properly formatted, Bingbot might still index those pages despite instructions to ignore them.

The Yahoo Slurp bot is generally considered less strict in following directives in the robots.txt file. In most cases, it will avoid pages restricted by the disallow directive.

The DuckDuckBot is usually respectful of directives outlined in the robots.txt file and does not vary its behavior or make exceptions.

How to create a robots.txt file

Many content management programs like Wix or WordPress automatically create a robots.txt file. However, the default directives in these will not be customized to every individual page or piece of content. You may still need to manually customize the robots.txt file.

To do so, you can use common text editors like Notepad or TextEdit. It’s important not to use a word processor as this can result in unexpected characters appearing which will compromise the integrity of the code and cause it to malfunction.

Follow the below rules when creating a robots.txt file:

  • The file must be named robots.txt
  • A site can have only one robots.txt file
  • The robots.txt file must be saved with UTF-8 encoding

Where to upload the robots.txt file

Upload the robots.txt file to your website’s root directory. The root directory is the top-level folder that contains all other files. If your website is called www.website.com then upload your robots.txt file as www.website.com/robots.txt.

All domain hosting sites have different server architectures and ways of uploading robots.txt files. Your domain hosting provider will be able to provide you with exact instructions.

How to check and verify your robots.txt file

Open a private browsing window and search for your robots.txt file via your search engine. If it doesn’t appear, you may need to check with your domain provider to see if the file was uploaded correctly.

You can verify the robots.txt file via the Google Search Console. The Google Robot Testing Tool (known as the robots.txt tester) will let you test your file to verify that it is blocking the right content from web crawlers. It’s important to note that the Google Robot Testing Tool is a simulation tool only. Any changes made won’t be reflected in your actual robots.txt file.

The Google Robot Testing Tool only tests against Googlebot and other Google-related bots. It’s advisable to use another tool to test other bots like Bingbot. There are numerous robots.txt tester tools available online, such as:

Blocking AI crawler bots with your robots.txt file

Artificial intelligence (AI) bots crawl websites for training data. Many website owners now update their robots.txt files to block AI bots from accessing proprietary or sensitive content. Many people object to their data being used to train large language models (LLMs) for ethical reasons.

Disallowing robots from AI companies is much the same as disallowing robots from search engines. All you need to know is the name of the user agent.

In this example, the disallow directive is used to block the OpenAI bot from ChatGPT:

You can also use a wildcard to block all bots with AI in the name:

Be aware that disallowing AI bots via robots.txt only stops the bots from indexing content, not from visiting a page or reading pages. If the bots can access content, for example via external links, they can still analyze it for training purposes to improve algorithms or models. This can occur even if the pages aren’t indexed and the content was never intended to be visible to the public.

To completely stop AI bots from collecting data you need to use a combination of both robots.txt and the noindex tag.

Common robots.txt mistakes

Three of the most common robots.txt mistakes are:

  • Over-blocking: Blocking too many pages restricts a search engine’s ability to crawl and accurately rank your pages.
  • Syntax errors: Typos can be costly. Syntax errors will disrupt your code, causing the robots.txt file to malfunction and become ineffective.
  • Blocking important resources: Blocking CSS or JavaScript files stops web crawlers from being able to read your site properly. These files display the site’s layout and functionality. Without access, crawlers can misinterpret the page’s structure which can impact indexing and search rankings.

How DataDome can help you block and manage crawlers

While your robots.txt file can, in theory, prevent any unwanted bots from accessing your pages, it offers little real protection against malicious bots, which usually ignore its instructions entirely.

In fact, many bad actors disguise their bots as legitimate crawlers, like Googlebot, to evade detection. Recent DataDome research shows that nearly three out of four fake Googlebots go unnoticed or unblocked by standard defenses.

To truly protect your site from harmful automated traffic, whether it’s content scrapers, scalper bots, or other forms of fraud, you need a robust bot management solution.

That’s where DataDome comes in. Our solution automatically detects and blocks malicious bots in under 2 milliseconds, ensuring real-time protection without disrupting legitimate users. You can also create custom rules to tailor how different types of bot traffic (such as AI bots) are handled, giving you full control over your defenses.

Whether you’re facing a sudden surge in scraper bots or fighting off sophisticated scalpers, DataDome provides the speed and precision you need to block bots before they cause damage.

Robots.txt Disallow FAQs

Can I block AI bots?

AI bots can be stopped by adding their user-agent name to the disallow directive in the robots.txt file.

What happens if I don’t have a robots.txt?

Search engine web crawlers will index every page on your site. This can result in irrelevant content being indexed which can negatively impact your page rankings.

What is the difference between robots.txt and meta tags?

Robots.txt controls access to your site at a directory level. Meta tags manage crawling and indexing behavior for individual pages.

DataDome
dd product home overview

Still exploring?

Start with an on-demand demo.