04 June 2019, 15:09
Serving your machine learning model so that it can be used in a real Production Ready setting, is more than simply “hosting” your model, especially with the broad range of options one can choose from. This consideration can also depend on the skills you have in house and how much of the infrastructure you want to handle by yourself. In this post I will show you some options using Google Cloud as the infrastructure provider. The amount of options for machine learning frameworks can be overwhelming as well, although when you want to prepare for serving I would suggest Tensorflow because it has some great built-in libraries for serving on almost any device.
. . .
Before we continue, I will shortly tell you about the importance of robust serving methods. So what is serving? Serving is when you apply and open up your model for predictions to other applications after you have trained it. First we should define what the requirements are of our serving model to adhere to, consider the following:
What you see above is what I call the scale of ML Serving, on the left we have the self-managed solution, while still relying on the basic infrastructure of Google. On top of compute engine your Tensorflow model is stored and with a Flask wrapper you can expose your model to the internet. In its basic form however you cannot guarantee any of the requirements, so you are solely responsible for all.
When we go to the right, we are opting already for more scalable solutions, by relying on the power of containerized solutions of Kubernetes we can make the system scalable. Tensorflow Serving (which is included in Tensorflow Extended (TFX)).
In Summary, Tensorflow Serving is a Docker image which is available through dockerhub, which you can simply add in your YAML file to deploy a container. Instead of using the Tensorflow Serving one could also opt for Seldon Serving, this is a new player which is marketing itself to be ML framework and Cloud provider agnostic.
The last option is to outsource all the infrastructure of your ML deployment to Google and make use of the Cloud Machine Learning Engine (ML Engine) which is part of the AI platform announced during the last Google Next ’19. With this platform you can train and deploy your model without much infrastructure overhead.
Typically, we would never recommend to use Compute Engine running a Flask API, so the choice is really between either TFX on Kubernetes or ML Engine. To make the trade-off we will consider the following aspects:
Unfortunately it’s not trivial to compare the cost of both options without proper context of the application that is using the model and the specifics of your organization. The only way to compare the tools is looking at the Total Cost of Ownership (TCO). We will simplify the TCO into 2 categories:
Performance can be seen as a combination of throughput and latency.
Both options allow for horizontal scaling and thereby should allow for infinite scaling (throughput). Both allow either manual scaling or autoscaling. To get started the autoscaling of ML Engine is very easy (as it is configured by default). However if you are not happy with the Autoscaling offered by ML Engine you don’t have a lot of options, which is the case for the TFX/Kubernetes option. On the other hand ML Engine does allow for batch predictions, which is not supported by TFX, and this can allow for very efficient autoscaling if you need a big throughput on batch data.
Regarding latency we have the same story: ML Engine has very nice and simple defaults, but doesn’t allow for very specific customizations (or a very wide range of machine types). This means that you will get fair performance if you are not an expert in optimizing your system, but if you are an expert, you will not be able to optimize as much as with the combo TFX/Kubernetes (e.g. checkout this nice blogpost).
As mentioned above, TFX does allow for much more customizations via the Kubernetes ecosystem regarding scaling and support of machine types. Also ML Engine does enforce some restrictions on model size (soft limit: 250MB) & requests size (hard limit: 4MB).
On the other hand ML Engine is more flexible with respect to frameworks: supporting frameworks such as XGBoost and Scikit-learn, which gives you the flexibilty to go beyond using Tensorflow, which is especially interesting if you are starting a project and you are not sure about the frameworks you want to use. Furthermore ML Engine now also integrated custom prediction routines (documentation). This means that you can include some custom Python code in the model serving service, this allows for extreme flexibility on what format the data can be send to your model and thereby ensures much more flexibility for communication with other services to interact with your model.
If you are not familiar with the Kubernetes ecosystem and/or you are still exploring multiple ML Frameworks, ML Engine is the perfect companion to easily serve ML models. Also if you don’t expect a lot of load and don’t have exotic requirements, ML Engine can potentially be the cheaper option.
However, if you already have a lot of Kubernetes expertise and actually have some spare resources available on some cluster, or if you expect to reach some “exotic” requirements not covered by ML Engine (model size, latency, request size), you will be better of with using TFX on Kubernetes. However, you should be aware of the added developer cost for this option.