AI Models in Production — The Beginning of the end

6 min readJun 7, 2022

Introduction:

It is a very common perception that the AI lifecycle ends when a model is put into production. Many of the famous frameworks used for AI have also depicted the same. For example, CRISP-DM is a very well known framework used for AI development that has deployment as the last stage of the lifecycle.

However, if you ask any seasoned Data Scientist, he/she would confirm that model deployment to production is just the beginning of the end stage of the AI lifecycle. Post deployment there are many issues that can arise which can make the whole model useless to the stakeholders. This is where AI monitoring/maintenance comes into picture. Many complex questions like:

How do you ensure that your AI models are performing as expected in production?
What if the customer behaviour changes and your training data becomes stale? How do you ensure the models are learning the new behaviour?
When do the AI models need retraining?

need to be tackled by Data Scientists/ML Engineers once the AI models move to production and to add to this complexity is the fact that AI monitoring/maintenance is a rapidly evolving field in terms of both tooling and techniques. Monitoring is ideally a disciplinary effort, yet it can mean different things across Data Science, Engineering, DevOps and the Business. In view of this complex combination and ambiguity, it is not surprising that many data scientists and Machine Learning (ML) engineers feel uncertain about monitoring.

Benefits of AI Monitoring

By monitoring the AI models and associated artefacts, one can check for unexpected changes in:

Model input data
Model predictions
Model quality and fairness
System operations
Business value generated

This helps in proactive identification of issues and their remediation's, so that the AI models perform in production as expected.

Quick Overview on AI lifecycle

The monitoring of any AI model refers to the ways in which we track and understand the performance of the model in production from both data science and operational perspective. Insufficient monitoring can lead to incorrect models left untested in production, stale models that stop adding business value, or subtle bugs/errors in models that appear over time and cannot be caught leading to inaccurate results.

There are six phases in the lifecycle of an ML model:

Model Building
Model Evaluation and Experimentation
Model Productionization
Testing
Deployment
Continuous Monitoring and Observability

Let’s go through the above phases in details with examples from different scenarios

Monitoring Scenarios

Scenario 1: The deployment of a brand-new model.

Scenario 2: Completely replace the existing model with an entirely different model.

Scenario 3: Small tweaks to current live model.

Say for example we have a model in production, and one variable is not available, so we need to replicate the existing model without that feature. Or we are developing an excellent feature that we think will predict in an amazing way, and we want to re-introduce our model, but now we are taking that new feature as an additional input.

No matter what the situation, by monitoring we can determine the changes in our model that have the effect we want, which is what we ultimately care about.

ML system behaviour tracking falls into three categories

Data (ML System specific requirement)
Model (ML System specific requirement)
Code (and Config)

The Responsibility Challenge and Why You Need Monitoring

“By failing to prepare, you are preparing to fail” - Benjamin Franklin

ML Systems Span Multiple Groups (can also include data engineers, DBAs, analysts, etc.):

Before we dig deeper into the monitoring, it is important to mention that it has different implications in different parts of the business.

Data scientists/Data Science view
Engineers & DevOps

Data scientists/Data Science view

We are talking about post-production monitoring and not pre-deployment evaluation (which is where we look at ROC curves and the like in the research environment.

Monitoring should be built to provide early warning against thousands of potential malfunctions with the ML production model, which includes the following:

Data Skews

Data Skews occur when our training data does not represent live (inference) data.

There are multiple reasons why this can happen:

We designed the training data incorrectly: Distributions of the training set variables do not match the distribution of the live data.
A feature is not available in production: This means the feature needs to be removed, change with the alternative similar feature that is present in production or re-create the feature.
Research/Live Data mismatch: Data that is being used to train the AI models comes from one source and live data from another source. So, the dataset might not be identically the same which leads to wrong predictions.

2. Model Staleness

Shifts in the environment: The behaviour of the trained model might not be the same as the live data. For example, if the model is trained from the time of the pandemic, it may not be effective for predicting default in times when the economy is healthy.
Changes in consumer behaviour: Customer preferences also change with trends in fashion, politics, ethics, etc. Especially in the recommended ML programs, this is a risk that should be monitored regularly.

3. Model Input Monitoring

Looking at the expected set of values for the input element, we can check whether a) the input values fall within the allowable set for the categorical inputs or range for the numerical inputs and b) that the frequency of each value between sets matches what we have seen in the past.

For example, model input is “gender” condition will check that inputs fall within the expected values.

Depending on the configuration of our model, we will allow certain input features to have null values. This is something we can monitor. If the feature is expected not to be null this may indicate distortion or skew of the data or a change in consumer behaviour, both of which could lead to further investigation.

4. Model Prediction Monitoring

In some automated (many of these in future sections) or manual processes we can compare the distribution of our model prediction with statistical tests when actuals are not available.

Basic statistics: median, mean, standard deviation, high values / min

For example, if the variables are normally distributed, we would expect the average values to be within the standard error of the mean interval.

This can be done

Engineers & DevOps

When you say “monitoring the ML models post-production” think about system health, latency, memory/CPU/disk utilization as well.

Conclusion

There are a lot of parameters and efforts involved to get continuous right value out of the model once the AI models are Productionized. So instead of organizations using Data scientists for monitoring purposes they can have one monitoring tool in place that monitors all of the above and also generate alerts for any issues, So that the data scientist can freely focus on the other AI use cases.

AI Models in Production — The Beginning of the end

Written by Cetas AI