How to Monitor ML Models Without Labels
At DataDome, we are using machine learning (ML) to enrich our data and make decisions about whether a request is coming from a human or from a malicious actor. We previously presented some ML use cases and explained how we build our label datasets for bot detection, as well as how we can debug machine models using our recently open-sourced package sliceline.
But the ML model lifecycle does not end with training. We also need to monitor a model’s performance after its deployment to production. This blog post focuses on classification model monitoring in real time—without access to labels.
Challenges of Monitoring Bot Detection Models
As an example, let’s take the case of request classification as human or bot using only stateless features. In this case, most of the available features—like User Agent, other HTTP headers, and TLS fingerprints—are categorical and have high cardinalities.
Due to the adversarial nature of bot detection, the distribution of features may change very rapidly because the bots are constantly adapting. So, there is a strong need to monitor our models that have been trained in batch mode to trigger retraining in case of drift.
One could imagine a system that collects labels in real-time and computes performance metrics such as the ROC-AUC or the F-score. However, this is not possible in our case, since we build datasets with labels that incorporate different sources of truth. In particular, we incorporate information that is only available later on, and therefore we cannot continuously compute performance metrics (see how we build our label datasets for bot detection).
An additional requirement is to monitor the model in a cost-effective way, ideally having one KPI to monitor, only occasionally drilling down when we detect performance degradation. Because we want to react as fast as possible—and due to the high volume of data we monitor using stream processing algorithms—we cannot monitor several KPIs for each model.
Keep in mind that we are using gradient-boosted tree models because they have good performance for tabular data and provide fast inference capabilities that allow us to meet our low latency requirements.
Unsupervised Model Monitoring
Before exploring different monitoring solutions, let’s define concept drift and relate it to the bot detection problem. If we denote X as the features and y as the target, a concept drift arises when the joint distribution—P(X,y)—changes with time. One way to decompose the joint distribution is the following:
Feature Drift
If only P(X) changes, then we are talking about feature drift or covariate shift, where the distribution of the features changes but the relation to the labels is constant. This can easily happen if a new version of a browser has been released and is gradually being adopted by users. The new browser version’s proportion in the feature values will grow with time. In the worst case, the new version might not even have been part of the model’s training dataset, which would lead to out-of-distribution samples.
Concept Drift
Concept drift arises when the feature target relation—P(y|X)—changes with time. For example, when an e-commerce company opens its business in a new country. While traffic from the new country might have been considered fraudulent before solely on the basis of its geographical origin, that relationship will no longer hold.
Detecting real concept drift through changes in P(y|X) without access to labels is practically impossible—so, from now on we will focus on feature drift detection. Some of the changes in P(X) will have an impact on the classifier’s performance and will actually indicate real concept drift.
Focusing Our Monitoring
Conceptually, the simplest thing for our team to do would be to monitor the distribution of all our features. But given the large number of features, that wouldn’t be cost-effective. Moreover, a drift in the feature distribution does not necessarily induce a drop in the model’s performance.
Another thing we have access to is the distribution of the model’s predicted probabilities. You can think of the model’s output as a dimensional reduction of the features. If the distribution of outputs changes, then the input must have also changed.
One can detect changes in the distribution of probabilities as a whole—but a more interesting approach is to focus on the margin. We expect the output of a binary classifier to be a combination of two probability densities:
- One for the positive class.
- Another one for the negative class.
The margin is the region separating the two distributions. Intuitively, you expect the classifier to be less certain about its predictions when the features are shifting, which causes the margin density to increase. In the case of a sudden shift, we can also see the margin density drop.
caused by sudden drift (right). [Figure taken from arxiv.org/abs/1704.00023.]
It is tempting to use the margin density as an indicator, as it focuses on the uncertainty of the model’s predictions. However, the value of the predicted probability is not always an indicator of the uncertainty of a classifier’s prediction.
In fact, we noticed that gradient-boosted tree classifiers can make very confident predictions (close to 1 or 0) for samples containing previously unseen values. The actual result depends on the random seed used to train the model. So, we need a more robust way to estimate uncertainty.
One way to estimate uncertainty is to use an ensemble of independent models—a method used to estimate uncertainty for neural networks using dropout layers. In our case, we can generate an ensemble of models using different random seeds.
We then estimate the uncertainty as the variance of the logits of the predictions over the different models of the ensemble. The intuition behind this approach is that if the data differ from what the models have seen during training, their predictions will be more diverse.
Using Uncertainty to Monitor Models
To illustrate the idea of using uncertainty to monitor ML models, we use an ensemble of histogram gradient boosting tree models trained on the KDD Cup 1999 (KD99) intrusion detection dataset. The KDD99 dataset contains three categorical features and several numerical features to classify different intrusion types.
We modified the target to turn the problem into a binary classification problem, distinguishing only between normal and abnormal traffic. Our focus was on out-of-distribution samples for categorical data, since those constitute the bulk of our datasets.
We ran the experiment by iteratively contaminating the categorical features with values unseen during training to compute the prediction uncertainty. The experiment simulated a setting in which new values appear and gradually spread. Every contamination level represents the data at a different point in time.
You can see below that there is a positive correlation between the contamination and the mean of the computed uncertainty.
computed as the ratio of unseen values during training in the dataset.
An interesting aspect of this approach is that one could qualify the uncertainty of a trained model, and then monitor the uncertainty in production to detect deviations. A lucky byproduct of computing the uncertainty is risk prevention: We can decide to abstain and not make any decision if the prediction is uncertain.
Since the uncertainty is computed for every sample, we can use the information to drill down and discover the root cause of the increase in uncertainty. We ran an experiment using sliceline and were able to uncover the values that we injected into the dataset.
Sliceline was originally designed to find slices where a model is underperforming. In this case, we provided the uncertainty score instead of the model’s error and discovered the samples containing the injected values.
Conclusion
Monitoring machine learning models without access to labels in the context of bot detection is a distinct challenge. But at DataDome, we approach it as an excellent opportunity to explore different solutions to complex problems, and to find the best answer for our customers.
Uncertainty quantification plays an important role in monitoring ML models and mitigating risk, which is why we are using it at DataDome to monitor the performance of models used for data enrichment. We are also in the process of implementing it in our low-latency models used directly for bot detection.
Related posts
European AI Act: What It Is, Why It Matters, & What to Do About It
Tell me more
Genetic Algorithms: Using Natural Selection to Block Bot Traffic
Tell me more
DataDome Page Protect Enables PCI DSS 4.0 Compliance Ahead of March 2025 Deadline
Tell me more
Boomer Benefits Stops Scraping & Preserves Their Competitive Edge
Tell me more
Security Alert: Fake Accounts Threaten Black Friday Gaming Sales
Tell me more
Network Intrusion Detection System: What Is It?
Tell me more