A Practical Guide for Deploying Embedding-Based Machine Learning Models

23 April 2020, 09:14

Deploying Embedding-Based Machine Learning Models: part 2

Model performance has increased dramatically over the last few years due to an abundance of machine learning research. While these improved models open up new possibilities, they only start providing real value once they can be deployed in production applications. This is one of the main challenges the machine learning community is facing today.

Deploying machine learning applications is in general more complex than deploying conventional software applications, as an extra dimension of change is introduced. While typical software applications can change in their code and data, machine learning applications also need to handle model updates. The rate of model updates can even be quite high, as models need to be regularly retrained on the most recent data.

Figure 1. The 3 axes of change in a Machine Learning application — data, model, and code — and a few reasons for them to change. Source:

This blog post is a follow-up on the article about a General Pattern for Deploying Embedding-Based Machine Learning Models. Embedding-based models are hard to deploy since all the embeddings need to be recalculated, all while ongoing traffic is not interrupted and shifted smoothly over to the new model. In this article, we introduce a set of tools and frameworks — Kubernetes, Istio and Kubeflow Pipelines — that allow you to implement this general pattern. It should be noted that this is just one way of doing it. There are plenty of viable practical implementations possible, you just need to figure out what works best for your team and application.

Starting from the Generalized Embeddings System

We will start from the generalized embeddings system, which was introduced in the general pattern blog post. Basically the generalized embeddings system is a generalization of a search engine and recommender system.

Figure 2. A search engine (left), recommender system (middle), and generalized embeddings system (right).

The generalized embeddings system has three functional components:

  • An embedding generator, returning embeddings based on its input. In the search engine, this is the model translating a search query into an embedding. In the recommender system, this is the user embedding data store returning a user’s embedding based on its id.
  • An embedding server, which hosts the pre-calculated embeddings for similarity search.
  • An application, which fetches an embedding from the embedding generator and sends it to the embedding server to execute a similarity search.

For this blog post, we will use the search engine system, since it contains an online model and is the most complex system.

Kubernetes as container-orchestration framework

The three components will need to run in a reliable and scalable manner. They will also need to be able to communicate with each other. Kubernetes is one of the leading frameworks when it comes to container-orchestration and it is a perfect fit for the requirements. It is a standard practice to containerize code and we expect that each of the components can be containerized with Docker.

Kubernetes provides some powerful service abstractions, which allow to expose workloads either within or outside a cluster. Here, we create internal services for the embedding components, such that the application can reach them from within the cluster. We expose the application itself to the outside world via an API gateway.

Figure 3. Kubernetes will be used to run the workloads of our generalized embeddings system in a reliable and scalable manner.

Embedding Generator with TensorFlow Serving

The embedding generator component will need to translate a search query into an embedding. As TensorFlow is one of the standard libraries for running Machine Learning workloads, we use TensorFlow Serving for our model serving. The only requirement for a TensorFlow Serving workload, is to have a SavedModel file and to indicate the location of this file when running the tensorflow/serving docker image . For more implementation details, check out the TensorFlow Serving guidelines for Kubernetes.

Hosting the Embedding Server on ElasticSearch

The embedding server can be implemented on ElasticSearch. Since ElasticSearch 7.3, it is possible to do a similarity search in vector space as a predefined function. For older versions of ElasticSearch, you will need to install a plug-in to be able to do a similarity search.

Whereas specialized similarity search frameworks as FAISS might be a bit faster for the similarity search, the benefit of using ElasticSearch is that you can combine filtering on content and rank on similarity score in vector space at the same time. ElasticSearch also supports adding documents to an index and patching documents on the fly. The insert or patch is materialised in near-real-time.

Hosting the Application with Connexion/Flask

Our application is implemented with the zalando/connexion framework. Connexion is a Swagger/OpenAPI First framework for Python on top of Flask, with automatic endpoint validation & OAuth2 support.

Figure 4. Embedding Generator with TensorFlow Serving, Embedding Server with ElasticSearch and Application with Connexion/Flask.

Up to now, we have shown how the generalized embeddings system can be implemented in a stand-alone manner.

Two versions of the embedding generator and embedding server

As mentioned in the general pattern blog post, there is a need for two sets of instances of the embedding generator and the embedding server in order to perform a deployment without downtime.

With regards to the embedding generator, which is implemented with TensorFlow Serving, we will use a separate deployment for each model. The kubernetes labels for the deployment share the same app: embedding-generator label but have a different version label according to the model version. This will be an important parameter for defining the routing logic later on.

For the embedding server, the situation is different. In this case we don’t have two instances but we have two indexes for the data that belongs to the different models.

Figure 5. Practical implementation of the two versions of the embedding generator and embedding server.

Up to now, we have shown how the generalized embeddings system can be practically implemented. As we have seen from the general pattern blog post, in order to deploy a new model version, we will need to have some advanced traffic routing mechanisms in order to gradually split traffic from one model version to another.

Figure 6. The new version of both the embedding generator and server are deployed alongside the old one, so the application can easily switch.

Advanced Traffic Routing with Istio

Let’s rephrase the traffic routing requirements in detail:

  • At the system boundary, a model version is assigned to each request according to the target weight percentages. For example, 80% of the incoming requests needs to go to version 1 and the remaining 20% goes to version 2. These target weight percentages must be easily adjustable so that we have fine-grained control how the traffic is shifted from one model version to another over time.
  • When the request propagates to the embedding components downstream, the request should be handled properly to get consistent results. Once a model version has been assigned to the request, it should get the embedding from that model version’s embedding generator and subsequently perform the embedding lookup in the right data index of our embedding server.

It is not feasible to cover these advanced requirements with native Kubernetes features. We need more advanced traffic routing features, and luckily there is a Kubernetes add-on called Istio that enables these features.


Istio makes it easy to create a network of deployed services with load balancing, traffic routing, monitoring, and more, with few or no code changes in service code. You add Istio support to services by deploying a special sidecar proxy throughout your environment that intercepts all network communication between microservices. The Istio control plane acts a centralized control unit for these sidecar proxies.

For this use case, we are especially interested in Istio because of the traffic routing features.

Figure 7. Istio architecture (source:

Header-based Traffic Routing with Istio

The sidecar proxy can intercept the incoming requests and perform logic on it, such as setting headers and perform traffic routing based on headers. The sidecar is deployed to all the components of our generalized embeddings system, including the API gateway, and the following logic is implemented:

  • At the system boundary, the API gateway, we will ask the sidecar proxy to assign a model version in the header request based on configurable split percentage between the models. This header can then be used in downstream activities.
  • We will leverage Istio’s header based routing so that the downstream request, to get the embedding, is collected from the embedding generator that corresponds to the model version specified in the header request. The routing logic is actually a mapping between the model version header and kubernetes label version.
  • For the similarity search on the embedding server, the model version in the header of the request is used to perform the lookup on the correct index.

Depending on the model version header that is assigned to the request, the downstream request flow will be different:

Figure 8. Different request flows based on the model version header set at the system boundary.

These advanced routing mechanisms allow us to define the weighted traffic split across the different model versions. In order to actually perform the model deployment, an orchestrator will be required to perform all the deployment steps in a precise and correct order.

Orchestrating the model deployment with Kubeflow Pipelines

The tools and frameworks are identified to be able to perform a model deployment. Now we will need to define all the steps that will be required to execute a reliable model deployment. The required steps are, from a high-level point of view:

  1. Deploy the new model version next to the current one.
  2. Direct streaming updates to both model versions.
  3. Start bulk load in order to backfill the new model version with all the re-calculated historical records.
  4. Once the bulk load has been processed by the new model version, it is ready to receive traffic and we can gradually shift traffic from the old to the new model.
  5. Remove the old model version
Figure 9. Steps to perform the model deployment.

Kubeflow Pipelines is an orchestrator that lets you define a set of operations and their order of execution. Each operation is defined by a Docker container. It is also possible to transfer information from one operation to another, this is extremely helpful if you have a dependent operation that needs the output of one of the previous operations. This is often the case with Machine Learning related workflows, where for example you pass on the storage bucket path to the preprocessed data to your training operation.

Ramped Model Deployment

The least complicated model deployment is to gradually ramp traffic from one model version to another in a fully automated fashion by defining a step-wise function that shifts X % every Y seconds. The speed at which you shift traffic highly depends on the volume of your traffic as you want to properly let the new model version scale in order to meet the traffic demand it gets.

Figure 10. Traffic gradually shifted from version 1 to version 2 in a fully-automated manner.

If all the deployment steps mentioned above are properly containerized, it is possible to chain this set of operations together in an execution graph so that the deployment can be reliably orchestrated. We can inspect the execution graph:

Figure 11: High-level orchestration pipeline to deploy the new model in a ramped manner.

In the ‘ramp’ step, the target weight percentages are updated so that traffic is gradually ramped to the other model version, as visualized in figure 10.

A/B Model Deployment

The steps 4 and 5 are slightly different for the A/B model deployment scenario as the model version to be promoted is variable. In this scenario, a pause step is introduced during the ramp. Based on the model performance, new target weight percentages are defined and the ramping interactively continues until a model is promoted. Traditionally in A/B testing, the new model version will only receive a small portion of the traffic at first and it will be checked how the new model version responds to real traffic. Depending on these results, the traffic might be gradually increased if the new model appears to be better. For example, this is a scenario where after a 50/50% traffic split it was decided to rollback to model version 1:

Figure 12. A/B model deployment scenario where version 1 is rolled back after splitting traffic all the way up to 50%/50%.

A simple tool to input these target weight percentages could be a message queue like Google Cloud Pub/Sub. The pipeline simply reads messages from a message queue and deploys the target weight percentages. In this case, the execution graph will have a ‘promote’ and ‘rollback’ branch, depending on the results of the A/B testing.

Figure 13. High-level orchestration pipeline to deploy a new model in A/B mode.


In order to benefit from the wide range of available Machine Learning Models, we need to be able to deploy them in production applications. Embedding-based models are particularly hard to deploy since all embeddings in the system need to be recalculated for each model version and traffic needs to be routed correctly while doing the model update.

We described a concrete set of tools and frameworks that can be used to implement the deployment strategy for these embedding-based models. Kubernetes makes it possible to deploy our embedding-based system, configure scaling behavior and expose services. Istio unlocks fine-grained traffic routing control mechanisms that are required to correctly split and route traffic across the model versions. Kubeflow Pipelines allows us to execute the set of operations, that are required to deploy a new model version, in a reliable and reproducible manner.

Mastering the technique for crafting model deployment pipelines as ‘a push on the button’ will unlock an immense potential. You will be able to test new machine learning models swiftly, without pain. Your engineering team does not need to spend time on redoing manual deployments over and over again. Instead, model deployments are automated, reliable, reproducible and can be triggered with a simple push on a button. This practice is often referred to as MLOps, analogous to DevOps.

Quality videos if you visitv freepornhd