How to Serve Models

home · about · subscribe

May 25, 2020 · -

There are many ways to serve ml(machine learning) models, but these are the most common 3 patterns I observed over the years:

3 most common ways of serving models

In this post, I want to go over these different architectures/patterns and then outline the advantages and disadvantages in a more objective manner. If you ask, where is the rant part, scroll down to the bottom of the post.

  1. Materialize/Compute predictions offline and serve through a database
  2. Use model within the main application, model serving/deployment can be done with main application deployment
  3. Use model separately in a microservice architecture where you send input and get output

In these 3 different approaches, while 3 might be more architecturally flexible and more modern in terms of microservice architecture, it is not a silver bullet solution as it requires and assumes infrastructure to be ready and able to support the flexibility of the architecture. While I believe this is the best architecture to be able to deploy, I also need to acknowledge other methods has their own advantages in comparison to the the 3rd architecture.

1, 2 and 3 can be considered as an evolution of the architectures as well. 1 is the fastest way to get something out of the door if the input domain is bounded and you want to serve the first model right out of the bat to be able to prove customer impact and business value of the model before investing a lot of operations and “making it right”. This way you incrementally evolve your architecture by delivering incremental values. However, if the input domains are not bounded, 2 approach might be a good compromise between different options. However, from maintainability and scalability perspective, 3rd provides the most benefits comparing to other two in terms of separation of concerns of training/model serving as well as easily be able to scaled up/down through traffic. Before further ado, let’s look at some of the advantages and disadvantages of each architecture.

1. Materialize/Compute Predictions

This is one of the earliest form of serving prediction architecture. A data scientist outputs SQL table and that table is ingested into production and served from a database from an application. Simple enough. Even though this architecture does not serve model directly, it has a number of advantages.



2. Embedded Model

In this architecture, model is embedded into the main application. This architecture is a compromise between 1 and 3 where it enables not storing the predictions and still enables applications to serve predictions realtime. However, it tightly couples the serving layer with the main application which increases overhead in maintaining and operating both model and application itself. Since in this architecture the model and application is coupled, it gives a benefit of not having the network call.



3. Microservice Model Serving

In this architecture, model is being served as a separate microservice independent from the main application. This architecture being the most flexible in terms of model deployment has also advantages and disadvantages.




Whenever I see the pre-computed, materialized model serve pattern, I cringe. Even in the best case scenario that you completely separate the tables between separate models, you still need to make sure that every single model iteration outputs a full fledged table for the inputs that you know. This is on top of the obvious limitation of, you cannot do anything in a very dynamic manner within the application. It also shows that the engineering organization is not mature enough that they cannot serve model in a production capacity for one way or another unless they opt-in this pattern for a very specific reason.

The microservice model is the one that everyone should aspire to even though you do not get it right in the first try, it will be worth it. This not only separates concerns of ml with the main application, but also empowers ml engineers to build the system end to end and make necessary adjustments in an isolated manner. If you want to use a different library or framework, go ahead; as long as you comply the SLA, you should be able to do anything you want. If you want to deploy a new model but do not have support versionining, use API versioning to distinguish between models. Do you want to do canary deployments as you do not know the performance of the model in an online setting, you can use the API canary deployment that you DevOps team built with no problem. If you cannot do microservice model, embedded model would be a good compromise.

All Rights Reserved

Copyright, 2020