How to Detect Text (NLP) and Image (Computer Vision) Data Drift
Introduction
In the real world, data is recorded in various systems and formats, and is constantly changing. These changes may occur with the introduction of noise due to ageing and mechanical cracking in portable systems or in the event of a fundamental change in production process or in the changes in behaviour of consumers. Such changes have implications for the accuracy of the predictions and make it necessary to test whether the assumptions made during the model development are still valid when the models are in production.
In the context of machine learning, we consider the data drift to be changed in model input data that leads to model performance impairment. In the remaining part of this article, we will introduce how to detect data drift for models that ingest image data or text data as their input in order to prevent their silent deterioration in production.
Four different types of real-time monitoring techniques to detect and reduce model drift
• Data quality — Helps detect changes in data schema’s and statistical properties of dependent and independent variables and alerts when drift is detected.
- Model quality — Monitoring model performance features such as accuracy or precision in real time by consuming actual values collected from the applications and automatically combining actual values information with prediction data
- Model bias — Although your initial data or model may not be biased, changes in the world may cause bias to develop over time in an already trained model.
- Model explainability — When there is a change in the relative importance of feature attributions data drift alerts you.
Let us discuss the types of data drift that apply to text data.
NLP Data drift
NLP is used in many different use cases, from chatbots and visual assistants to machine translation and text summaries. To ensure that these applications operate at the expected level of performance, it is important that the data in the training and production area comes from the same distribution. When the data that is used for inference (production data) differs from the data used during model training, we encounter something known as data drift. In the event of a data drift, the model no longer operates on data in production and may do worse than expected. It is important to continuously monitor inference data and compare it with data used during training.
Data drift can be divided into three categories depending on whether distribution changes occur on the input or output side, or whether the relationship between input and output has changed.
Covariate Shift
In covariate shift, input distribution changes over time, but conditional distribution P (y | x) does not change. This type of drift is called a covariate shift because the problem arises due to changes in the distribution of covariates (features). For example, in the email spam segmentation model, the distribution of training data (email corpora) may differ from the distribution of data during scoring.
Label Shift
Although covariate shift focuses on feature distribution changes, label changes focus on changes in class variability. This type of shift is actually a reverse covariate shift. The correct way to think it would be to consider an unbalanced dataset. If you spam the non-spam rate of emails in our training set is 50%, but in reality, 10% of our emails are not spam, then the target label distribution has changed.
Concept Shift
The change in concept is different from covariate and label changes because it is not related to data distribution or class distribution but rather relates to the relationship between the two differences. For example, spam email senders often use a variety of concepts to convey spam filtering models, and the concept of emails used during training may change over time.
Model Monitoring: Method
One can develop a simple model monitoring system that calculates text embedding in training data and inference data. Then calculate the cosine similarity to see how close our training data and inference are.
Image Data Drift
Given the growing interest in deep learning, models that import non-traditional data forms such as informal text and images into production are growing. In such cases, methods ranging from statistical control and performance research that rely primarily on numerical data are difficult to accept and require a new way to monitor models in production. Let’s explore an approach that can be used to detect data drift for models that classify/score image data.
Model Monitoring: Method
Our approach does not make any assumptions about the model used, but requires access to the training data used for modeling and the inference data used for scoring. The most accurate way to detect image data drift is to build a machine-readable representation of the training data set and use that to reconstruct the data presented to the model. If the reconstruction error is high then the data presented in the model is different from what it was trained for. The sequence of activities is as follows:
1. Learn the representation of the low dimension part of the training data (encoder)
2. Reconstruct the validation data using the representation from step 1 (decoder) and store reconstruction loss as baseline reconstruction loss
3. Reconstruct batch of data that is being sent for predictions using encoder and decoder in Steps 1 and 2; store reconstruction loss
If the reconstruction loss of the dataset being used for predictions exceeds the baseline reconstruction loss by a predefined threshold, set off an alert.