Blog

A Guide to GPU Sharing on Top of Kubernetes

09 July 2020, 08:19

Currently, the standard Kubernetes version does not support to share GPUs across pods.

Kubernetes documentation on scheduling GPUs

However, by extending the kubernetes scheduler module we confirmed that it is possible to actually virtualise the GPU Memory and therefore unlock flexibility in sharing and utilising the available GPU power.

In this blog post, we aim to guide you through the set-up steps to share Nvidia GPUs on top of Kubernetes, a container-orchestration platform. Being able to share GPUs allows you stretch multiple workloads onto a single GPU. For example, during training you might utilise the full GPU capacity with one training job, but during inference you will most likely want to spread multiple instances across the available GPU(s).

When we talk about “GPU sharing” in this post, we mean sharing the GPU on top of Kubernetes. Sharing a GPU within a virtual machine or on bare metal is also feasible but not sufficient if we want to unlock the GPU capacity in a container-orchestration framework.

We assume that you start from a kubernetes cluster with control plane access. In a follow-up post we will provide sample code to deploy a self-managed kubernetes cluster on Google Cloud.

GPU sharing importance

You might wonder why GPU sharing is important, because we can spin up a wide range of GPUs on-demand on multiple cloud providers nowadays. The problem is more visible and profound in an on-premise setting.

Let’s consider an on-premise setting with a fixed number of bare metal machines available, equipped with some GPU cards. If we deploy Kubernetes on top of these machines, we have basically virtualised the compute and memory capacity of these machines into a cluster. Kubernetes implements Device Plugins to let Pods access specialised hardware features such as GPUs (Kubernetes docs). But as you can see from the docs (as mentioned in the introduction):

  • Containers (and Pods) do not share GPUs. There’s no overcommitting of GPUs.
  • Each container can request one or more GPUs. It is not possible to request a fraction of a GPU.

The consequence is that a single container claims the entire GPU. This is especially worrisome when your workload only utilises a small fraction of your GPU power. It would be beneficial if more containers could share the GPU so the workloads can be stretched on the single GPU and therefore the GPU is used at its full extent.

GPU sharing possibilities

The Graphics Processing Unit can theoretically be separated in the hardware layer. In this case the separation is done at the physical level and the workload is isolated to the available separated hardware chunk. Not many GPUs support this advanced hardware layer separation. It is probably a costly feature that will only ship with very large GPUs.

Another way of separating and sharing the GPU, is on the software layer. In this case the separation is done at the software level and the workload can reach the entire GPU capacity. The software layer should handle the isolation of workloads, meaning that you should limit the machine power somewhere in your code. In the context of machine learning, TensorFlow allows to limit the GPU power as per instructions here.

In this blog post, we exploit the GPU sharing on the software layer in the Kubernetes container-orchestration framework. We will ensure that the container scheduler takes into account GPU type and GPU memory resource requirements. Based on these resource requirements, the container scheduler ships the container on a worker node and exposes the GPU hardware to the container.

Starting from a Kubernetes cluster with control pane access

In order to install the GPU sharing extension, we will need to make some adjustment to the kubernetes master node(s). Make sure you have sudo access rights on the kubernetes master node(s).
Our cluster contains 3 master nodes and 4 workers nodes:


Kubernetes nodes

In our set-up, all the nodes are equipped with Ubuntu 18.04 LTS, and one of the worker nodes has a Tesla K80 GPU attached.

Preparing your node(s)

Preparing the nodes involves a couple of steps, per node. We will showcase the commands for our Ubuntu machines but we point to the relevant documentation where you can find installation instructions for other operating systems.

First, we will install the GPU driver on the specific node. By installing the GPU driver, the operating system knows how to talk to the the GPU. The installation process on Ubuntu just takes a couple commands:

# SSH into the worker machine with GPU
$ ssh USERNAME@EXTERNAL_IP

# Verify ubuntu driver
$ apt install ubuntu-drivers-common
$ ubuntu-drivers devices

# Install the recommended driver
$ sudo ubuntu-drivers autoinstall

# Reboot the machine
$ sudo reboot

# After the reboot, test if the driver is installed correctly
$ nvidia-smi

Next, we need to install Nvidia-docker2 so the containers can also access the GPU, in the Nvidia container runtime. For our Ubuntu machine, we can install Nvidia-docker2 by executing the commands:

# SSH into the worker machine with GPU
$ ssh USERNAME@EXTERNAL_IP

# Add the package repositories
$ distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
$ curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add –
$ curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

$ sudo apt-get update && sudo apt-get install -y nvidia-docker2
$ sudo systemctl restart docker

Finally, we need to enable the Nvidia runtime as the default container runtime. You can do this by editing the docker daemon config file that is normally present at etc/docker/daemon.json . The following command simply overwrites the existing docker daemon config with the required config:

# SSH into the worker machine with GPU
$ ssh USERNAME@EXTERNAL_IP
$ sudo tee /etc/docker/daemon.json <<EOF
{
“default-runtime”: “nvidia”,
“runtimes”: {
“nvidia”: {
“path”: “/usr/bin/nvidia-container-runtime”,
“runtimeArgs”: []
}
}
}
EOF
$ sudo pkill -SIGHUP docker
$ sudo reboot

By setting the Nvidia runtime as the default container runtime, we ensure that containers get mounted in the Nvidia runtime so they can access to the GPU, if applicable.

Installing the GPU sharing extension

Now that our nodes are properly set-up, we will install the GPU sharing scheduling extension. We will closely follow the remaining steps outlined in the installation guide.

On the control plane, we will extend the scheduler so it can take into account GPU memory:

# SSH into master node(s)
$ ssh USERNAME@EXTERNAL_IP

$ cd /etc/kubernetes/
$ sudo curl -O
https://raw.githubusercontent.com/AliyunContainerService/gpushare-scheduler-extender/master/config/scheduler-policy-config.json

$ sudo cp /etc/kubernetes/manifests/kube-scheduler.yaml /tmp
# edit the file via VIM as explained here
$ sudo vim /tmp/kube-scheduler.yaml
$ sudo cp /tmp/kube-scheduler.yaml /etc/kubernetes/manifests/kube-scheduler.yaml

# verify that the file is now extended with the extra config
$ sudo vim /etc/kubernetes/manifests/kube-scheduler.yaml

The kube-scheduler will automatically restart once it detects the manifest has been altered.
Next, we will deploy the GPU sharing device plugin from your local workstation that has access to the Kubernetes API:

# From your local machine that has access to the Kubernetes API
$ curl -O https://raw.githubusercontent.com/AliyunContainerService/gpushare-scheduler-extender/master/config/gpushare-schd-extender.yaml
$ kubectl create -f gpushare-schd-extender.yaml

$ wget https://raw.githubusercontent.com/AliyunContainerService/gpushare-device-plugin/master/device-plugin-rbac.yaml
$ kubectl create -f device-plugin-rbac.yaml

$ wget https://raw.githubusercontent.com/AliyunContainerService/gpushare-device-plugin/master/device-plugin-ds.yaml
# update the local file so the first line is ‘apiVersion: apps/v1’

$ kubectl create -f device-plugin-ds.yaml

The device plugin is a daemonset that will run on nodes based on some label criteria. In this case, we need to add a label “gpushare=true” to all nodes where we want to install the device plugin:

# From your local machine that has access to the Kubernetes API
$ kubectl label node worker-gpu-0 gpushare=true

The device plugin will expose the GPU memory capacity and keep track of the GPU memory allocation:


GPU worker node Capacity and Allocation details

The kubectl extension is only available in linux for now, so you will have to install kubectl and the extension from a linux machine:

# SSH into linux machine
$ ssh USERNAME@EXTERNAL_IP
$ mkdir .kube
$ exit

# copy over the kubeconfig from your local workstation to the linux machine (if your local workstation is not linux)
scp kubeconfig.conf USERNAME@EXTERNAL_IP:/home/USERNAME/.kube/config

# SSH back into linux machine
$ ssh USERNAME@EXTERNAL_IP

# kubectl installation
$ curl -LO https://storage.googleapis.com/kubernetes-release/release/v1.12.1/bin/linux/amd64/kubectl
$ chmod +x ./kubectl
$ sudo mv ./kubectl /usr/bin/kubectl

# kubectl gpushare extension installation
$ cd /usr/bin/
$ sudo wget https://github.com/AliyunContainerService/gpushare-device-plugin/releases/download/v0.3.0/kubectl-inspect-gpushare
$ sudo chmod u+x /usr/bin/kubectl-inspect-gpushare

You can verify the kubectl gpushare client by running kubectl inspect gpushare:


Kubectl GPU sharing client — No GPU Memory allocation yet.

Currently, no GPU enabled workload is running on the cluster and therefore the allocation of the GPU Memory is 0%.

Smoke Test

Let’s have a look at how we can request GPU Memory for a specific workload:

apiVersion: v1
kind: Pod
metadata:
name: gpu-share-pod1
spec:
restartPolicy: OnFailure
containers:
– name: gpu-share-pod1
image: “cheyang/gpu-player:v2”
env:
– name: NVIDIA_VISIBLE_DEVICES
value: “all”
resources:
limits:
aliyun.com/gpu-mem: 3

As can be seen, the GPU memory resource request is similar to a request of CPU or memory. After applying this manifest, you should see that the pod is allocated to the GPU worker node:

When inspecting the GPU sharing metrics, we can see that the pod has been ‘virtually’ allocated 3 Gigabyte of VRAM.


Kubectl GPU sharing client — 3Gig GPU Memory allocated

When we verify the logs of the pod itself, we can see that TensorFlow successfully found the GPU device:

Logs of pod “gpu-share-pod1” — TensorFlow finds GPU device

We can spin up another workload to ensure that different pods can reach the same underlying GPU:

apiVersion: v1
kind: Pod
metadata:
name: gpu-share-pod2
spec:
restartPolicy: OnFailure
containers:
– name: gpu-share-pod2
image: “cheyang/gpu-player:v2”
env:
– name: NVIDIA_VISIBLE_DEVICES
value: “all”
resources:
limits:
aliyun.com/gpu-mem: 5

After applying the manifest, we can see that that the allocatable memory was successfully updated:

Kubectl GPU sharing client — 3+5 GPU Memory allocated

When we again consult the logs of the pod, we can see that TensorFlow is able to see the GPU device and note that the freeMemory now is not 11Gig but 8Gig:

Logs of pod “gpu-share-pod2” — TensorFlow finds GPU device with up-to-date freeMemory

As a final test, we can SSH into the GPU worker node and check the GPU processes:


Nvidia-smi output showing two processes are attached to the GPU

Conclusion

In the context of Machine Learning, more and more workloads benefit from using a Graphics Processing Unit (GPU). However, these GPU devices are costly so it is extremely important that they can be utilised to their full extent. Especially in an on-premise setting, where you are limited to a fixed amount of machines and GPUs, it’s important to be able to share the GPU capacity across different workloads. For example, during training you might occupy the full GPU capacity with one training job, but during inference you will most likely spread multiple instances across the available GPUs.
Currently, the standard Kubernetes version does not support to share GPUs across pods. However, by extending the kubernetes scheduler module we confirmed that it is possible to actually virtualise the GPU Memory, similarly like Kubernetes virtualises the compute and memory of your worker nodes. It is important to note that this sharing mechanism is happening on the software layer and the GPU isolation should also be taken care of properly in the workloads that you run on the GPU (e.g. TensorFlow memory limit instructions).