23 April 2020, 09:14
Model performance has increased dramatically over the last few years due to an abundance of machine learning research. While these improved models open up new possibilities, they only start providing real value once they can be deployed in production applications. This is one of the main challenges the machine learning community is facing today.
Deploying machine learning applications is in general more complex than deploying conventional software applications, as an extra dimension of change is introduced. While typical software applications can change in their code and data, machine learning applications also need to handle model updates. The rate of model updates can even be quite high, as models need to be regularly retrained on the most recent data.
This blog post is a follow-up on the article about a General Pattern for Deploying Embedding-Based Machine Learning Models. Embedding-based models are hard to deploy since all the embeddings need to be recalculated, all while ongoing traffic is not interrupted and shifted smoothly over to the new model. In this article, we introduce a set of tools and frameworks — Kubernetes, Istio and Kubeflow Pipelines — that allow you to implement this general pattern. It should be noted that this is just one way of doing it. There are plenty of viable practical implementations possible, you just need to figure out what works best for your team and application.
We will start from the generalized embeddings system, which was introduced in the general pattern blog post. Basically the generalized embeddings system is a generalization of a search engine and recommender system.
The generalized embeddings system has three functional components:
For this blog post, we will use the search engine system, since it contains an online model and is the most complex system.
The three components will need to run in a reliable and scalable manner. They will also need to be able to communicate with each other. Kubernetes is one of the leading frameworks when it comes to container-orchestration and it is a perfect fit for the requirements. It is a standard practice to containerize code and we expect that each of the components can be containerized with Docker.
Kubernetes provides some powerful service abstractions, which allow to expose workloads either within or outside a cluster. Here, we create internal services for the embedding components, such that the application can reach them from within the cluster. We expose the application itself to the outside world via an API gateway.
The embedding generator component will need to translate a search query into an embedding. As TensorFlow is one of the standard libraries for running Machine Learning workloads, we use TensorFlow Serving for our model serving. The only requirement for a TensorFlow Serving workload, is to have a SavedModel file and to indicate the location of this file when running the tensorflow/serving docker image . For more implementation details, check out the TensorFlow Serving guidelines for Kubernetes.
The embedding server can be implemented on ElasticSearch. Since ElasticSearch 7.3, it is possible to do a similarity search in vector space as a predefined function. For older versions of ElasticSearch, you will need to install a plug-in to be able to do a similarity search.
Whereas specialized similarity search frameworks as FAISS might be a bit faster for the similarity search, the benefit of using ElasticSearch is that you can combine filtering on content and rank on similarity score in vector space at the same time. ElasticSearch also supports adding documents to an index and patching documents on the fly. The insert or patch is materialised in near-real-time.
Our application is implemented with the zalando/connexion framework. Connexion is a Swagger/OpenAPI First framework for Python on top of Flask, with automatic endpoint validation & OAuth2 support.
Up to now, we have shown how the generalized embeddings system can be implemented in a stand-alone manner.
As mentioned in the general pattern blog post, there is a need for two sets of instances of the embedding generator and the embedding server in order to perform a deployment without downtime.
With regards to the embedding generator, which is implemented with TensorFlow Serving, we will use a separate deployment for each model. The kubernetes labels for the deployment share the same
app: embedding-generator label but have a different
version label according to the model version. This will be an important parameter for defining the routing logic later on.
For the embedding server, the situation is different. In this case we don’t have two instances but we have two indexes for the data that belongs to the different models.
Up to now, we have shown how the generalized embeddings system can be practically implemented. As we have seen from the general pattern blog post, in order to deploy a new model version, we will need to have some advanced traffic routing mechanisms in order to gradually split traffic from one model version to another.
Let’s rephrase the traffic routing requirements in detail:
It is not feasible to cover these advanced requirements with native Kubernetes features. We need more advanced traffic routing features, and luckily there is a Kubernetes add-on called Istio that enables these features.
Istio makes it easy to create a network of deployed services with load balancing, traffic routing, monitoring, and more, with few or no code changes in service code. You add Istio support to services by deploying a special sidecar proxy throughout your environment that intercepts all network communication between microservices. The Istio control plane acts a centralized control unit for these sidecar proxies.
For this use case, we are especially interested in Istio because of the traffic routing features.
The sidecar proxy can intercept the incoming requests and perform logic on it, such as setting headers and perform traffic routing based on headers. The sidecar is deployed to all the components of our generalized embeddings system, including the API gateway, and the following logic is implemented:
Depending on the model version header that is assigned to the request, the downstream request flow will be different:
These advanced routing mechanisms allow us to define the weighted traffic split across the different model versions. In order to actually perform the model deployment, an orchestrator will be required to perform all the deployment steps in a precise and correct order.
The tools and frameworks are identified to be able to perform a model deployment. Now we will need to define all the steps that will be required to execute a reliable model deployment. The required steps are, from a high-level point of view:
Kubeflow Pipelines is an orchestrator that lets you define a set of operations and their order of execution. Each operation is defined by a Docker container. It is also possible to transfer information from one operation to another, this is extremely helpful if you have a dependent operation that needs the output of one of the previous operations. This is often the case with Machine Learning related workflows, where for example you pass on the storage bucket path to the preprocessed data to your training operation.
The least complicated model deployment is to gradually ramp traffic from one model version to another in a fully automated fashion by defining a step-wise function that shifts X % every Y seconds. The speed at which you shift traffic highly depends on the volume of your traffic as you want to properly let the new model version scale in order to meet the traffic demand it gets.
If all the deployment steps mentioned above are properly containerized, it is possible to chain this set of operations together in an execution graph so that the deployment can be reliably orchestrated. We can inspect the execution graph:
In the ‘ramp’ step, the target weight percentages are updated so that traffic is gradually ramped to the other model version, as visualized in figure 10.
The steps 4 and 5 are slightly different for the A/B model deployment scenario as the model version to be promoted is variable. In this scenario, a pause step is introduced during the ramp. Based on the model performance, new target weight percentages are defined and the ramping interactively continues until a model is promoted. Traditionally in A/B testing, the new model version will only receive a small portion of the traffic at first and it will be checked how the new model version responds to real traffic. Depending on these results, the traffic might be gradually increased if the new model appears to be better. For example, this is a scenario where after a 50/50% traffic split it was decided to rollback to model version 1:
A simple tool to input these target weight percentages could be a message queue like Google Cloud Pub/Sub. The pipeline simply reads messages from a message queue and deploys the target weight percentages. In this case, the execution graph will have a ‘promote’ and ‘rollback’ branch, depending on the results of the A/B testing.
In order to benefit from the wide range of available Machine Learning Models, we need to be able to deploy them in production applications. Embedding-based models are particularly hard to deploy since all embeddings in the system need to be recalculated for each model version and traffic needs to be routed correctly while doing the model update.
We described a concrete set of tools and frameworks that can be used to implement the deployment strategy for these embedding-based models. Kubernetes makes it possible to deploy our embedding-based system, configure scaling behavior and expose services. Istio unlocks fine-grained traffic routing control mechanisms that are required to correctly split and route traffic across the model versions. Kubeflow Pipelines allows us to execute the set of operations, that are required to deploy a new model version, in a reliable and reproducible manner.
Mastering the technique for crafting model deployment pipelines as ‘a push on the button’ will unlock an immense potential. You will be able to test new machine learning models swiftly, without pain. Your engineering team does not need to spend time on redoing manual deployments over and over again. Instead, model deployments are automated, reliable, reproducible and can be triggered with a simple push on a button. This practice is often referred to as MLOps, analogous to DevOps.
Quality videos if you visitv freepornhd …