How to Monitor Models

home · about · subscribe

November 24, 2020 · -

After we serve machine learning models, we will also need to monitor them. In this post, I will be going at details of monitoring machine learning models.

Needless to say, monitoring is very important part of the machine learning based applications and services. On top of the existing/usual concerns for any type of software based systems, there are also unique challenges that ml brings to the table in terms of monitoring, two of these unique issues are:

If you compare this with non-ml based applications, the difference and change are data, model characteristics and how model behaves. This is only being exacerbated by the time that model is being put and serving real customer traffic in production, it starts deteoriating gradually as the training/validation set being used to tune the parameters of the model differs over time with the production dataset.

Due to all of these needs, one needs to put a lot of more thought building ml based systems and that requires even more work on monitoring aspects of the applications/services.

The two questions that will outline the monitoring concerns are in the following:

How do I know my models perform well/worse?

If there is one, deterioriation of the model will be measured against a business/customer impacting metric. This is also good metric to measure across the site to understand downstream effects of the model.

How do I know if my model is down?

This is a better studied question and it is relatively easy to answer. We define SLA and certain operational metrics around the service/API that is serving the model/predictions and measure against the SLA.

I will answer both of these questions in a much more detailed manner in this post, but let’s first go with what we need to do accommodate change that is crucial part of the ml based systems.


Unlike software systems, where the change is more often than not root cause of the disturbance in the system, ml systems behave quite differently. The inputs to the system which defines on how your machine learning based application works, changes over time. It is your best interest to adjust/change the model to the changing world all the time.

In software non-ml based systems, you can do property based testing and all of the other advanced testing methods to test your software against possible inputs. Because of that, change carries a risk and code freeze aims to reduce change in order to increase reliability. Especially, if you have intervals/windows that are mission critical like holiday season if you are in ecommerce space, this increases stability of the system and it is considered to be best practice for software systems.

In ML based systems, this is simply not possible and not best strategy to adopt ever changing data and consumer needs. As data input changes as well as the behavior that you are seeing that changes, you need to change the model, training data as well as other components that respond to those changes.

If you were to capture all of the possible input and output relations in a deterministic manner in terms of testing, you do not need an ml model in the first place.

The reason why you are building machine learning based applications is to generalize all of the inputs that have not been seen before and this is why ml exists but also makes it hard to test.

Monitoring of MLOPS components

Change Anything Changes Everything

In Google’s seminal paper, the concept of CACE(Changes Anything Changes Everything). If you are adding a new field in the form where you collect data, be mindful of this principle. If you are using failed requests to a system as negative samples, you may have a hard time when there are a lot of bad actors that want to log-in to the system. If you are removing one analytics beacon data from existing data pipeline,

Data Pipeline/Preparation

Monitoring of the models should start on the data level as this part will have a significant impact on the model performance. Also, this might be a place where you can get best of return of investment as in this layer, ETL(extract-transform-load) can be monitored and alerted on basic checks. Since there is a large amount of dependency for features, model training on the basic data preparation, you need to make sure that all of the issues/errors can be monitored properly and alerted.

Some common things to monitor:

If you are using a streaming based solution to ingest data, you might have a better luck in terms of lineage of the data as long as you establish a mechanism to ensure that metadata is preserved and communicated to the downstreams.


If you explicitly use features in the data pipeline/offline model training as well as in the serving layer, this is another area one needs to put monitoring. If your serving layer and data processing/pipeline layer do not share the similar technology stack, that is another reason why this place needs to be monitored closely with possibly some alerts.

Some more areas that need to be alerted:


For model training, we not only need to make sure that upstream data and features are guarded but also the model training needs to be monitored. In here, the very basic things like the mean, standard deviation as well as other statistical properties of prediction in evaluation and test set would be useful.

Some common things to monitor in this:


Unlike software systems, machine learning systems need more monitoring. The reason for that is that software systems need to answer questions around SLA which can be measured on uptime as well as SLA violation. However, ML Systems need to answer more questions than software systems around model itself on the serving layer.

Important Operational Metrics

On serving layer, one can look at the usual suspects of the systems:

Model Quality Metrics

If we cannot measure the model quality through a well-defined business/customer metric, one might come up a metric that shows/signified how good the model is in the production differrently.

There are many ways to do look at the model quality metrics:

Note that, in these metrics, we are not trying to measure the accuracy of the mpdel, but also the distribution of the input data as well as distribution of output of the model. For example, input distribution change causes well-known issue of concept drift. Ideally, we should build guards against these type of problems/issues beforehand. Every model since it has been put into production suffers from concept drift. This is because the training data used to train model differs from online data. This is bound to happen as the time of the model has been put into production, the model is expected to degrade.

Data Growth Strategy

As you are building different features, your customer base will grow and data ingestion and processing will increase as well. Monitor this growth from infrastructure perspective. Ask the following questions:

User Behavior Change Strategy

In case there is a drastic user behavior change due to exogenous factors, we should have a plan to monitor such events and react to that. Some useful questions are:

Best Practices

All Rights Reserved

Copyright, 2020