DataDome
Engineering

How to Safeguard Low Latency of a Highly Available Service

Table of content
Karim Heraud, Technical Product Manager
8 Aug, 2023
|
min

As a software as a service (SaaS) product, DataDome’s bot protection relies on integrations between our customers and our APIs—which means low latency is key for service performance. This article tackles how low latency is enforced at DataDome, with a strong focus on the reliability of our architecture.

Distributing Services Around the World

Network communications between a client and a service can be split into three steps:

  1. Locating the target service: DNS resolution.
  2. Initializing the communication channel: Establishing the TCP connection, SSL handshake.
  3. Exchanging content: HTTP requests/responses (or other protocol).

Good practices on the client or service side reduce the time needed for each step and optimize the total latency of requests:

Optimizations Sensible to service location?
DNS Resolution
  • Fast and distributed DNS provider.
  • Avoid CNAME records (flatten records).
  • Smart time to live (TTL) for records.
No (Sensible to the DNS provider’s location.)
Establishing the TCP Connection
  • Leverage keep-alive connection.
  • Test TCP BBR congestion control.
Yes
SSL Handshake
  • Short certificates chain.
  • Smart choice of ciphers.
  • Enable TLS resumption.
Yes
Real Dialog
  • Content compression.
  • Condensed protocol (HTTP 2/3).
Yes

We will not cover the technical optimizations in this article.

Most of the communication steps above are very dependent on the target service’s location, due to the time physically required to route packets across the network from client to target server. Travel time is nearly impossible to guarantee and depends on many factors, such as:

  • The quality, length, and congestion of network links.
  • The health of intermediary network equipment.
  • The paths selected by the equipment when routing packets (depends on the load/congestion, announcements, etc.)

To reduce the travel time, the target service has to be as close as possible to the client—and if you want to serve many clients around the world, your service must be well-distributed to stay close, no matter which client requests it.

In order to be as close as possible to our customers, DataDome relies on 26+ points of presence (PoPs) distributed around the world.

points of presence map

Map of DataDome points of presence.

Routing Clients to the Closest Location

For our bot protection service, DataDome exposes a single DNS endpoint: api.datadome.co. In order to reduce travel time, we need to route each client to its closest location.

The two main options for routing are anycast addressing and geo-DNS routing.

Anycast Addressing

The internet is a mesh of interconnected networks. Each network announces at its edge which IP ranges it can route. When a packet travels to a target IP address, it is routed by in-between “autonomous systems” (“AS”; large networks or groups of networks with a common routing policy) in a way that reduces the number of network hops required to reach the destination. When you announce an IP address in only one location (unicast addressing), you might need many network hops to reach your location.

Unlike unicast addressing, anycast addressing relies on announcing an IP address in different networks. When a packet is being routed between networks to reach this target IP address, it is directed to the appropriate location that requires the least network hops.

This way, the service can be distributed and travel time can be lowered by reducing the number of networks to be crossed. As an example, here is an underlying network that decides where your client will go, depending on network health and hops.

unicast addressing routing

User calls IP1, which relies on unicast addressing.

anycast addressing routing

User calls IP1, which relies on anycast addressing to reach the closest location.

Geo-DNS Routing

Geo-DNS routing leverages a very different mechanism. Each of your locations has a different IP address. Target location is decided at the very beginning, when the client resolves the target IP for the service using a DNS instead of during the request routing across networks. The resolver handling the DNS query can decide to answer differently, depending on the query’s source IP address location.

Geo-DNS relies on this principle. If your DNS request originates from France, you will be resolved to an IP address deployed in France. If your DNS query originates from Brazil, you will be resolved to an IP address deployed in Brazil.

To use geo-DNS routing, you need to configure your DNS resolver rules (response per country, specific IP range exclusion, sticky session, round-robin, etc.). Then, each DNS request can be resolved based on those rules. You can also add a health check mechanism, load assessment, maintenance windows, etc. to your DNS resolver rules.

At DataDome, we rely on geo-DNS to route requests because it is more flexible than relying on anycast.

  • The quality of networks and global travel time is easily monitored with HTTP probes which update DNS resolver rules in real time. With anycast addressing, we would need to rely on IP range announcements made by each network (BGP advertisements) to have optimized routing all the time—which we have very little control over.
  • In addition, if we face a temporary outage on a target location, probes piloting DNS resolvers would quickly route traffic elsewhere. With BGP advertisements, we would be dependent on network updates and new routes calculated for a fallback option. Here again, we would not have enough control.
  • Geo-DNS allows us to enforce country regulation and isolation more easily.

On the other hand, geo-DNS requires more work to map our services. We test point-to-point latency and guess network performance and peerings to have efficient routing for every client.

Europe Geoproximity Map

Geo-DNS mapping applied in Europe to route customers to the closest point of presence.

Failover & Availability

Unfortunately, at one point or another, all networks fail and all systems crash.

RAID configurations, cross availability-zone clusters, hot-cold workloads, etc. are key in case of an outage in part of a region, but they won’t cover a situation in which a whole region or network is down. Maintaining a highly available service is not only a matter of high availability architectures in every location, but when a target location becomes unresponsive or slow to query, customer traffic must be routed to another healthy location quickly.

Relying on geo-DNS gives DataDome control over routing during such events. The DNS resolver maintains a map of regions and points of presence health, which is updated in real time to remove PoPs if they become unhealthy. When a PoP faces degraded performance, the resolver stops resolving to it and falls back to the closest healthy location.

Still, this DNS mechanism comes with a drawback: the caching of DNS responses. Caching is defined with a time to live (TTL) setting applied per DNS record. The configuration defined by the service provider is then applied on the client side. From a client perspective, caching is useful to reduce the time taken for network communication—it prevents the client from having to wait a full DNS round trip before querying a target IP. On the other hand, this caching controls the frequency of updates of the target location (IP) of a service. When location A (IP A) fails, clients relying on it then have to wait until the next DNS refresh to be relocated on location B until A is healthy again.

outage failover timeline

Anatomy of an outage in a region.

In order to maintain high availability of service (and keep periods of unavailability very short), we rely on a short TTL—the refresh rate of the DNS target. Although it implies a slight delay for network communications at every DNS refresh, the tradeoff is advantageous given the request rate of our customers.

region outage graph

We lost one region (eu-west-1) for ~4 minutes.

healthcheck causing outage

The health check of the region triggered an outage.

healthy region taking over

The closest healthy region temporarily took over the area during the outage.

Conclusion

Geo-DNS is a powerful tool that allows us to answer client requests very quickly, no matter their source location. In addition, paired with health checks, geo-DNS allows us to keep handling traffic, even during a provider or network outage. We leverage geo-DNS at DataDome to bring to our customers a very low latency service focused on reliability.

Still, the mechanism relies on the location of the source DNS query, which implies that the IP referential must be accurate and the source IP of the DNS query must reach our DNS resolvers. At DataDome, we face some odd corner cases linked to those two assumptions—but we have automated procedures in place to monitor traffic latency and quickly detect and fix similar cases. We work tirelessly to mitigate any issues that arise as we protect our customers around the globe.

Stay tuned for more on this topic…