Storing 50 Million Events Per Second in Elasticsearch: How We Did It

Bot management API

Introduction

DataDome’s SaaS bot management solution is designed to protect customer websites against OWASP automated threats: credential stuffing, layer 7 DDoS attacks, SQL injection, and intensive scraping (which can lead to issues such as ticket scalping). The solution protects all our customers’ vulnerability points (web, mobile apps & APIs) with cutting-edge artificial intelligence technology, providing real-time bot detection and automated blocking decisions.

The DataDome solution uses Apache Flink for real-time event analysis in order to detect new threats, and Elasticsearch to store requests from all our customers’ visitors. We receive these requests from our customers’ web servers, and use them to provide bot traffic statistics, feedback loops, business insights and bot attack details in the user dashboard.

Our cluster stores more than 150TB of data, 15 trillion events in 60 billion documents, spread across 3,000 indexes and 15,000 shards over 80 nodes. Each document stores 250 events.

During peak charge each day, our Elasticsearch cluster writes more than 200,000 documents per second and has a search rate of more than 20,000 requests per second.

Our indexes separate data by day, and we have one index per customer.

Our Sharding Policy Challenge

We provide our customers with up to 30 days of data retention in our Elasticsearch cluster. Initially, our cluster was composed of 30 dedicated servers split into a hot-warm architecture. Our hot layer consisted of 15 servers with CPUs (20 threads and 5 SSD disks in RAID0), while our warm layer consisted of 15 servers with lower CPUs in order to reduce the cost because the data inside the warm layer are less requested.

The last seven days of data are stored in the hot layer, and the rest in the warm layer.

A common best practice is to keep a shard size around 50GB. As described above, we have dedicated indexes for each customer, but our customers do not all have the same workload. Our biggest customers write tens of thousands of documents per second, while our smallest write a few hundred. Furthermore, we had to constantly adapt the number of shards to the evolution of our customers’ traffic, in order to respect this best practice of 50GB per shard.

To resolve this issue, we ran a job each day to update the mapping template and create the index for the next day, with the right number of shards according to the number of hits our customer received the previous day. Since we know that one document indexed weighs around 1,000 bytes, we can predict the number of shards we need for each index to keep the shard size at or below 50 GB.

If we put a document inside a nonexistent index, an index will be created for it. Due to the design of our daily index, this could cause some trouble for our cluster (yellow status, unassigned shards, etc.), because we have a lot of throughput and the primary nodes need to allocate the shards to the nodes at the same time (around midnight). However, a job that creates a new index a day in advance helped us avoid these issues.

In the example below, our index pattern is “index_clientID_date”

With this sharding policy, we had almost 72 shards for our largest customer and only 1 shard for our smaller customers.

With the constant flow of new customers using our solution, our cluster became oversaturated with shard and we experienced a decrease in performance, both in read and write. We also saw a significant increase in load average on our data nodes.

We performed a benchmark for some search and write requests, and found that the more our shards grew during the day, the more our search and write performances decreased. In the evenings, when we have a spike of traffic and the shards are bigger, our Elasticsearch performance was particularly poor.

Whenever a node had trouble and went down, our cluster suffered, because relocating a big index (72 shards of 50GB) costs a lot in write threads, io disk, CPU and bandwidth—especially during writes.

Our benchmark showed that the perfect shard size for us is actually around 20GB, given our use case, our document size, our traffic, our index mapping and our node type. Beyond this size, the performance (both in read and write) starts to decrease.

So how could we keep our shards around 20GB?

By adding more shards to our indexes and making our cluster even more oversaturated with shard? Certainly not!
By using rollover? Yes!

How We Solved the Sharding Policy Challenge

In a few words, a rollover is composed of an alias which receives the requests for both reads and writes. Behind the alias, we have one or many indexes. The read requests are forwarded to all the indexes, while the write requests are forwarded only to the index with the flag “is_write_index” set to true. You can read more about index rollover in Elasticsearch’s official documentation.

Thanks to the rollover, we reduced our shard count by three, as well as the load and CPU consumption on our nodes. Rollover allows us to use fewer shards simultaneously during writes (i.e reduce our load average and CPU usage). The biggest index now has a maximum of one shard on each hot node (so 15 in total), and small and medium indexes can have from one to six shards depending on their workload.

The rollover also helped optimize our read performance by using the cache more efficiently, because each write on an index invalidates the whole cache.

Furthermore, when a node crashes and a lot of shards have to relocate, smaller shards means less time to recover, less bandwidth, and fewer resources consumed.

Last but not least, we applied a “max_size” policy type: each time an index reaches 400GB, a rollover will occur and a new index will be created.

As you can see, a write on “index_10_2019-01-01-000002” will not invalidate the cache of “index_10_2019-01-01-000001”. Also, our indexes are smaller than our shards.

As a result, our shards do not exceed 20GB, and we reduced the load average and optimized the write and read performance as well.

The Hotspot Challenge

With this, had we managed to design the perfect cluster? Unfortunately not.

After some weeks during which our cluster performed very well, it became unstable just after one hot node went down, which we recovered and returned to the cluster.

What happened?

A few hours after we brought the node back to life in our cluster, a lot of indexes triggered a rollover and all new shards went to this node. Only one node received almost all the write traffic, leading to an unbalanced cluster.

Why?

By default, Elasticsearch takes care to balance the number of shards on each node in the same layer (hot or warm). As a result, almost all the new shards got rolled over, even the 14 shards of the big index.

Let’s look at an example that shows how our cluster could become unbalanced. The example date is January 3, 2019. We have set “replica 0” in our index’s settings—remember, we have one index per day and per customer (10, 20, and 30 are our customers’ IDs)—and in this case the balancing is perfect, because the writes are spread evenly across our nodes. We have one shard in write per node.

Now let’s assume that node 3 goes down:

As expected, all shards from node 3 are moved to node 1 and node 2. And now, what will happen if node 3 comes back to life and a rollover occurs just a few seconds later?

Due to the default Elasticsearch shard placement heuristics, we now have all the shards in write on node 3!

How could this be solved? Could we change the heuristic algorithm?

Unfortunately, it wouldn’t have helped. By default, Elasticsearch tries to balance the number of shards per node. Changing this setting could help us to balance the number of shards per index and per node instead of the number of shards per node, but it would only have helped for big indexes which have one shard per node. For the rest of the indexes, which have fewer shards (let’s say 10 shards) than the hot nodes, this setting doesn’t prevent Elasticsearch from putting all the shards on the first ten nodes only.

In this example, we can see 6 indexes :

Index_10_2019-01-03-000001 which is in write.
Index_20_2019-01-03-000001 which is in write.
Index_30_2019-01-03-00001 which is in write.
Index_10_2019-01-02-000001 which is not in write.
Index_20_2019-01-02-000001 which is not in write.
Index_30_2019-01-02-00001 which is not in write.

Even if we try to spread the shards by index and by node instead of just by node, we can find a case where our cluster will be unbalanced.

One solution could be to set the number of shards equal to the number of nodes, but as discussed above, a shard has a cost.

How We Solved the Hotspot Issue

In April 2019, Elasticsearch released version 7.0 which introduced a new feature: index lifecycle management (ILM).

Thanks to this new feature, we are now able to split our data nodes in three layers : hot, warm and cold.

The main benefit of the index lifecycle management feature is that it allows us to move a shard from hot to warm immediately after the rollover of the index. We can keep a hot layer with only in-write indexes, a warm layer with shards for the last seven days of data in read-only, and the cold layer with shards for the last 7 to 30 days of data (or more).

Because the writes represent eighty percent of our activity, we want to have a hot layer with only shards in write. This will allow us to keep a balanced cluster, and we don’t need to worry about the hotspot issue.

How to Set Up ILM

Create a policy.
Link the policy to your template.
Create your index in rollover mode suffixed by “-000001”.

Let’s take a look at the process for an index moving from the hot node to the cold node through the warm node:

Conclusion

Thanks to the rollover and index lifecycle management features, we solved the main performance and stability issues of our Elasticsearch cluster.

Improving our monitoring has allowed us to better understand what is happening inside our cluster. For each index, no matter its size, we now have shards with no more than 25GB of data on each. Having smaller shards also enables better rebalancing and relocation when needed.

We avoid hotspot issues because our hot layer only has shards in write, and the hot, warm and cold architecture improves our cache utilization for read requests.

That being said, the cluster is still not perfect. We still need to:

Avoid hotspots inside the warm and cold layers (even if our main concern is the scalability for write operations).
Closely monitor our jobs creating the rollover aliases.
Create our indexes one day before using them.

So what’s next? Continuing to scale! As more and more security professionals realize the need for behavioral detection and real-time bot protection software, the volume of requests we process is growing exponentially. Our next milestone is therefore to be able to process 500 million write requests per second. Stay tuned!