A General Pattern for Deploying Embedding-Based Machine Learning Models

27 January 2020, 10:05

by Robbe Sneyders

Model performance has increased dramatically over the last few years due to an abundance of machine learning research. While these improved models open up new possibilities, they only start providing real value once they can be deployed in production applications. This is one of the main challenges the machine learning community is facing today.

Deploying machine learning applications is in general more complex than deploying conventional software applications, as an extra dimension of change is introduced. While typical software applications can change in their code and data, machine learning applications also need to handle model updates. The rate of model updates can even be quite high, as models need to be regularly retrained on the most recent data.

This article will describe a general deployment pattern for one of the more complex kinds of machine learning systems to deploy, those built around embedding-based models. To understand why these systems are particularly hard to deploy, we’ll first take a look at how embedding-based models work.

Embedding-based models

Figure 1. Embedding space of images generated by

Embedding-based models are emerging across all machine learning domains. They have recently unleashed a revolution in the field of NLP and are at the core of most modern recommendation engines. Google uses embeddings to find the best results for your search query, while Spotify uses them to generate personalized music recommendations.

Simply put, these models project or ‘embed’ their input into a vector representation, or embedding. Vision models embed images, language models embed words or sentences, and recommender systems do the same for users and items.

The generated embeddings are extremely powerful, as they can summarize the structure of a dataset in relatively low dimensionality. In the resulting vector space, similar input records are mapped closely together, while dissimilar items are mapped far apart. This enables the comparison of complex objects, which would be impossible in the original data space.

These embeddings can be shared between models across different data domains, and new models solving new problems can be built on top of them. It’s not hard to imagine how valuable such an integrated system of machine learning models can be.

Embedding-based systems

Unfortunately, one embedding is not very useful by itself, but only becomes powerful when compared with others. Since it’s infeasible, and often undesirable, to recalculate all embeddings each time, they are usually pre-calculated and kept in a real-time data store for comparison.

This is exactly what makes these systems hard to deploy. Every time the model is updated, all embeddings need to be recalculated. For systems with millions of records, this can take a long time, during which the normal operation of the live system cannot be compromised. Even in smaller systems, such an update is not instant and can lead to inconsistent results if not managed correctly.

We will take a look at two embedding-based systems, a search engine and a recommender system, and define a general deployment strategy that works for both. While these systems are similar, they differ enough to provide a generalization for a wide array of embedding-based systems.

Figure 2. A search engine (left), recommender system (middle), and generalized embeddings system (right).

Search engine

The goal of our search engine is to find the best matching documents for a search query. It consists of three components: an application, a model and an embedding data store. When the application receives a search query, it calls the model to translate the query into an embedding, which it then uses to execute a similarity search across the document embeddings in the data store.

Recommender system

The goal of our recommender system is to suggest the most interesting items to a user. It also consists of three components: an application, a user embedding data store and an item embedding data store. To recommend items to a user, the application first fetches the user embedding from the user data store, and then uses it to execute a similarity search across the item data store.

The biggest difference between both systems is the presence of an online model in the search engine, while all embeddings are pre-calculated in the recommender system. However, the same three functional components can be recognized in both systems:

  • An embedding generator, returning embeddings based on its input. In the search engine, this is the model translating a search query into an embedding. In the recommender system, this is the user embedding data store returning a user’s embedding based on its id.
  • An embedding server, which hosts the pre-calculated embeddings for similarity search.
  • An application, which fetches an embedding from the embedding generator and sends it to the embedding server to execute a similarity search.

We’ll demonstrate the deployment pattern using this generalized system.

Deploying a new model without downtime

When retraining or fine-tuning a model, the way the data is represented in the embedding space changes. To get coherent results, the embeddings returned by the embedding generator and those stored in the embedding server should be generated by the same model version.

The first step to prepare for a new model deployment is to recalculate the embeddings for all records in the system with the new model and store them in a new data store. The most straight-forward way is to calculate them in batch, separate from the live system. Once all embeddings are recalculated, the new embedding generator and server can be deployed into the live system.

A naive approach might be to try and deploy both the new embedding generator and server at exactly the same time. But even if they can both be switched to their new version perfectly in sync, which can be hard to achieve in practice, this approach is still insufficient to guarantee coherent results. Outdated embeddings from the old embedding generator might already be in-flight and reach the embedding server only after the update, leading to a mismatch in the similarity search.

Figure 3. The new version of both the embedding generator and server are deployed alongside the old one, so the application can easily switch.

It becomes clear that to ensure continuity, the update should be atomic from the view of a single application call. When the application fetches an embedding from the generator, it should always execute the similarity search in an embedding server with matching version. To achieve this, the old and new version of both components need to be deployed alongside each other at least momentarily, during which time the switch between both versions can happen at the level of the application call. Afterwards, the old version can simply be deleted.

Figure 3 shows how consecutive application calls can be switched to the new version this way without introducing any downtime or inconsistency.

Going streaming

Modern systems are often more complex than the simple ones we initially introduced, since the data they handle needs to be kept up-to-date continuously. New documents need to be added to our search engine, or existing documents might get updated. A new user might subscribe to our recommender system or update their profile, while new items could regularly be added to the catalog.

Some systems might be able to get away with calculating these changes in batch and periodically replacing the old data store with a new one, but doing so would add a significant delay before a new or updated record becomes available in the system. With streaming updates becoming a requirement for more and more systems, a streaming-native deployment strategy is needed. To derive this strategy, we’ll first re-introduce our search engine and recommender system as streaming systems.

Figure 4. Loading streaming updates into the search engine.

Upgrading our search engine for streaming updates is almost trivial, as it already hosts a model for online embedding calculation which can be reused for streaming inference during data loading. A new data loader component is introduced into the system, which orchestrates incoming document updates. It first embeds the incoming documents with the online model, and writes the generated embeddings to the embedding data store.

Upgrading our recommender system requires a bit more effort, as the streaming updates require an online model, which previously was not a part of the system. After adding both an online model and a data loader component, the data loading flow is equivalent to the one of the search engine. The data loader first calls the online model to embed the items and users, and writes the generated embeddings to the corresponding data store.

Since both systems are equivalent, we’ll demonstrate the streaming model deployment using the search engine for simplicity.

Streaming model deployment

Figure 5. During a streaming model deployment, a bulk load is performed by the new version, while both versions keep receiving streaming updates.

Instead of separately pre-calculating all the embeddings for a new version in batch, we’ll now integrate this into the streaming system itself. As a first step, the new version of the model and a new data store are deployed alongside the original version. To make sure that no streaming updates are lost during loading time, they are directed to both versions. A bulk load of all records in the system is then started and routed only to the new version of the model and data store.

When the bulk load is complete, both versions contain the same data records, but with embeddings calculated by their respective model. This state is identical to the one we discussed for the batch systems, and just like before the application can now easily switch traffic to the new version. Once the switch is complete, the old version can be deleted.

This approach has the additional advantage that the same components are used for both streaming and bulk load calculation, leading to consistent results at all times. For the search engine, the same model component is used for online search embedding as well, preventing mismatches between online calculated and pre-calculated embeddings.

Model A/B testing

Figure 6. Traffic can be shifted gradually to a new version. By freezing the shift at a certain percentage, the new model can be A/B tested.

For a system deployed in production, there’s a good chance that components of the active version have scaled to handle the incoming load. A hard switch between versions might then put too much load on the new version at once. Since both versions should be available simultaneously anyway, traffic can be shifted to the new version gradually instead, giving it time to scale as needed. This also reduces the impact of any problems possibly occurring with the deployment of a new version, as the shift can be stopped or reversed if needed.

The same mechanism can be used for A/B testing the new model by freezing the shift at a fixed percentage. The new model can then be evaluated and compared to the active model before deciding if the deployment should be completed or rolled back. Since loading the full data set can be expensive, automatic tests can already use the same mechanism during loading to benchmark the new model on a limited production dataset.


To start enjoying the model improvements brought on by machine learning research, we need to be able to deploy them in production applications. Embedding-based models are opening up new possibilities across domains, but are particularly hard to deploy since all embeddings in the system need to be recalculated for each model version.

We described a general deployment strategy for these embedding-based models, which deploys a new version of both the model and embedding store alongside the previous version, so the system can easily switch. We extended the strategy for streaming systems and showed how it natively supports A/B testing.

With this deployment strategy and A/B testing in place, we can rapidly iterate on new model improvements and accelerate further research.

Interested in how we apply this strategy at ML6? Keep an eye out for our follow-up blogpost where we’ll discuss tools & frameworks.