Unlock the Power of GPUs in Kubernetes for AI Workloads
Here’s a question. Where do we run AI models? Everyone knows the answer to that one. We run them in servers with GPUs. GPUs are much more efficient at processing AI models or, to be more precise, at inference.
Here’s another question. How do we manage models across those servers? The answer to that question is… Kubernetes.
Kubernetes is the de-facto standard to manage any type of workloads, be it stateless apps, stateful apps, jobs, AI models, or anything else. We just need to tell Kubernetes how much memory and CPU our workloads need and it will figure out where to put them. However, with AI our primary requirement is not the amount of memory and CPU but, rather, the amount of GPU we need, whether GPU is dedicated to a single process exclusively or shared across processes, and a few other things. In other words, the way we manage GPU-based workloads is different from more traditional workloads, yet somehow the same.
Today we’ll skip explanation why AI works better with GPUs than CPUs and focus on running AI models and managing GPUs in Kubernetes. We’ll explore not only how to run AI models in Kubernetes but also how to do it in a way that does not result in bankrupcy.
Setup
git clone https://github.com/vfarcic/kubernetes-gpu-demo
cd kubernetes-gpu-demo
Watch https://youtu.be/WiFLtcBvGMU if you are not familiar with Devbox. Alternatively, you can skip Devbox and install all the tools listed in
devbox.json
yourself.
devbox shell
The setup currently works only with Google Cloud. Some modifications to both the setup and the steps that follow might be required if you prefer using a different provider.
chmod +x setup.nu
./setup.nu
source .env
Using GPUs for AI in Kubernetes
I already have a Kubernetes cluster running. It’s a “normal” cluster that happens to be GKE in Google Cloud. Nevertheless, the logic behind the rest of the post should be the same no matter where your cluster is.
That cluster does not have any nodes with GPUs so there are at least three tasks we need to perform to make it “AI-ready”.
To begin with, we need to figure out how to create nodes or node groups with GPUs. Now, those nodes would not be of much use by themselves. We would also need to install device plugins that will let Pods access specialized hardware features which, in this case, are GPUs. Finally, we need to instruct Pods to use GPUs.
Let’s start from the begining.
Here are the nodes of my cluster.
kubectl get nodes
The output is as follows (truncated for brevity).
NAME STATUS ROLES AGE VERSION
gke-dot-default-pool-... Ready <none> 112s v1.29.7-gke.1008000
gke-dot-default-pool-... Ready <none> 38s v1.29.7-gke.1008000
As I already mentioned, none of the nodes currently used by that cluster have GPUs. Hence, we should add a node group or a node pool with GPUs to the cluster. But, before we do that, we need to figure out what is available for a given provider. In case of Google Cloud we can list all those currently available.
gcloud compute accelerator-types list --project $PROJECT_ID
The output is as follows (truncated for brevity).
NAME ZONE DESCRIPTION
...
nvidia-l4 us-east1-b NVIDIA L4
nvidia-l4-vws us-east1-b NVIDIA L4 Virtual Workstation
nvidia-tesla-a100 us-east1-b NVIDIA A100 40GB
nvidia-tesla-p100 us-east1-b NVIDIA Tesla P100
nvidia-tesla-p100-vws us-east1-b NVIDIA Tesla P100 Virtual Workstation
...
Those are accelerators that we can pick. We also need to select a machine just as we’d normally do except that those with GPUs tend to be labeled differently.
Anyways… We’ll create
a new node pool that will be attached to the dot
cluster. Since I’m cheap, we’ll select the smallest machine type (a2-highgpu-1g
), only 1
node and, most importantly, choose nvidia-tesla-a100
as the GPU.
gcloud container node-pools create dot-gpu \
--project $PROJECT_ID --cluster dot --zone us-east1-b \
--machine-type a2-highgpu-1g --num-nodes 1 \
--no-enable-autoupgrade \
--accelerator type=nvidia-tesla-a100,count=1,gpu-driver-version=default
The output is as follows.
...
Creating node pool dot-gpu...done.
Created [https://container.googleapis.com/v1/projects/dot-20240819000228/zones/us-east1-b/clusters/dot/nodePools/dot-gpu].
NAME MACHINE_TYPE DISK_SIZE_GB NODE_VERSION
dot-gpu a2-highgpu-1g 100 1.29.7-gke.1008000
Here are the nodes of the cluster.
kubectl get nodes
The output is as follows.
NAME STATUS ROLES AGE VERSION
gke-dot-default-pool-... Ready <none> 10m v1.29.7-gke.1008000
gke-dot-default-pool-... Ready <none> 8m47s v1.29.7-gke.1008000
gke-dot-dot-gpu-... Ready <none> 47s v1.29.7-gke.1008000
That’s it. That’s all it takes to add nodes with GPUs to the cluster and with that single action we not only added nodes with GPUs (gke-dot-dot-gpu-*
) to the cluster but we made sure that those nodes have the necessary device plugins installed.
Now, to be clear, not all providers are just as easy as Google Cloud and the instructions for your favorite provide might vary. What matter is that we did two out of three steps to run AI models in the cluster and the rest is going to be the same no matter which provider you prefer using.
We’re almost ready to use those new GPU nodes. The only thing missing is to double check taints
in those nodes.
kubectl get node \
--selector cloud.google.com/gke-nodepool=dot-gpu \
--output jsonpath="{.items[0].spec.taints[0]}" \
| jq .
The output is as follows.
{
"effect": "NoSchedule",
"key": "nvidia.com/gpu",
"value": "present"
}
The important thing to note is that those nodes have the NoSchedule
taint effect meaning that no Pods will be scheduled to run in them unless we explicitly specify the matching toleration.
Now we should be able to run our own or third party apps that contain models that require GPUs. We’ll opt for the latter since it’s easier and, by exploring the outcome, will give us a better understanding what is required.
We’ll run Ollama which already has a Helm chart available and all we have to do is specify the type of GPU it should use.
I will assume that you are familiar with Ollama AI models. If that’s not the case and you would like me to explore it in more detail, all you have to do is let me know in the comments of this video.
Here’s an example Helm values file we’ll use.
cat ollama-values.yaml
The output is as follows.
ollama:
gpu:
enabled: true
type: nvidia
number: 1
models:
- llama2
ingress:
enabled: true
className: traefik
hosts:
- host: ollama.35.229.75.73.nip.io
paths:
- path: /
pathType: Prefix
Over there, we’re specifying that gpu
usage should be enabled
, that we are using nvidia
for that and that the application should use only 1
GPU. Further on, since Ollama allows us to use quite a few models, we’re specifying that today we’re interested only in llama2
. The rest are ingress
values that are not relevant for today’s subject.
Now, to be clear, those values by themselves will not help you understand how to use GPU with your own or with any other third-party models. However, later on, we’ll inspect what Ollama deployed and that will enable us to deduce what we should specify if we’d like to use GPU with virtually any application or any model since the logic is always the same.
Let’s apply the chart.
helm upgrade --install ollama ollama \
--repo https://otwld.github.io/ollama-helm \
--values ollama-values.yaml \
--namespace ollama --create-namespace --wait
The output is as follows.
Release "ollama" does not exist. Installing it now.
NAME: ollama
LAST DEPLOYED: Mon Aug 19 00:23:35 2024
NAMESPACE: ollama
STATUS: deployed
REVISION: 1
NOTES:
1. Get the application URL by running these commands:
http://ollama.35.229.75.73.nip.io/
Now comes the important part. Since all the changes we should do to make containers in Pods use GPU are in the manifest of those Pods, we can inspect a Deployment Ollama created. Since that Deployment creates a ReplicaSet which creates the Pods, that should give us a clue as to what we should do to use GPUs.
kubectl --namespace ollama get deployment ollama \
--output yaml | yq .
The output is as follows (truncated for brevity).
apiVersion: apps/v1
kind: Deployment
metadata:
...
name: ollama
...
spec:
...
template:
...
spec:
containers:
- env:
...
- name: NVIDIA_DRIVER_CAPABILITIES
value: compute,utility
- name: NVIDIA_VISIBLE_DEVICES
value: all
...
resources:
limits:
nvidia.com/gpu: "1"
...
tolerations:
- effect: NoSchedule
key: nvidia.com/gpu
operator: Exists
...
The first thing you’ll notice are environment variables NVIDIA_DRIVER_CAPABILITIES
and NVIDIA_VISIBLE_DEVICES
. The first one specifies that GPU should accelerate compute
and utility
. We could have specified that it should accelerate video as well but, since Llama2 model does not support video, there is no need for that one. We could set it to all
if we’re too lazy. That’s what is used with NVIDIA_VISIBLE_DEVICES which specified which devices will be injected by the NVIDIA Container Runtime.
Those environment variables are not that important. What is important is that limits
in resources
is set to use maximum 1
nvidia.com/gpu
. That should be self-explanatory. I can’t affort more than one GPU so that’s what I’m using.
Finally, we have tolerations
which, in general, allow us to tell the scheduler to place Pods only on nodes with matching taints. In this case, we’re telling it to place the Pods with the model only on nodes that have the key
set to nvidia.com/gpu
. That’s the same key that was set to the nodes of the node pool with GPUs we created earlier. As a result, it is clear to Kubernetes that the Pod with Llama2 model should run only on specific nodes, on those with GPUs.
Since those instructions were set to the Deployment template, the end result is that we have a Pod running in the only node with GPU and is using that GPU to accelerate calculations.
Now we can use the local ollama client to communicate with the model running in the cluster. To do that, first we’ll define the environment variable OLLAMA_HOST
to point to the Ingress endpoint,…
export OLLAMA_HOST="http://ollama.$INGRESS_HOST.nip.io"
…and list all available models.
ollama list
The output is as follows.
NAME ID SIZE MODIFIED
llama2:latest 78e26419b446 3.8 GB 13 minutes ag
There is only one model since that’s what we specified. The important part is that we proved that the model is running in the cluster and that we can access it.
Let’s use it by executing ollama run llama2
and passing it a question.
ollama run llama2 "How to run GPU in Kubernetes?"
The output is as follows (truncated for brevity).
Running a GPU (Graphics Processing Unit) in a Kubernetes cluster can be challenging due to the lack
of native support for GPUs in containers. However, there are several ways to run a GPU in a
Kubernetes cluster:
1. Using NVIDIA's GPU Pod Dispatcher: NVIDIA provides a tool called GPU Pod Dispatcher that allows
you to run GPU-intensive workloads on Kubernetes clusters. The GPU Pod Dispatcher manages the
allocation of GPU resources to containers and schedules them on the appropriate nodes in the cluster.
...
The first thing you’ll notice is that it was lightning fast. GPU accelerated the calculation and we got an answer almost instantly.
Now, let’s say that we would like to run a second model. Normally, that would be something completely different but, for the sake of simplicity, we’ll apply the same one again but with a different name.
helm upgrade --install ollama2 ollama \
--repo https://otwld.github.io/ollama-helm \
--values ollama-values.yaml \
--namespace ollama --create-namespace
What do you think happened? Is the second model running?
Let’s check it out.
kubectl --namespace ollama get pods
The output is as follows (truncated for brevity).
NAME READY STATUS RESTARTS AGE
ollama-... 1/1 Running 0 18m
ollama2-... 0/1 Pending 0 2m51s
The second model is in the Pending
state. Kubernetes cannot place it and we can try to figure out why is that so by describing that Pod.
kubectl --namespace ollama describe pod \
--selector app.kubernetes.io/instance=ollama2
The output is as follows (truncated for brevity).
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal NotTriggerScaleUp 4m1s cluster-autoscaler pod didn't trigger scale-up: 1 Insufficient nvidia.com/gpu
Warning FailedScheduling 4m (x2 over 4m1s) default-scheduler 0/3 nodes are available: 3 Insufficient nvidia.com/gpu. preemption: 0/3 nodes are available: 3 No preemption victims found for incoming pod.
We can see that there are no available GPUs. None of the nodes can accomodate that Pod. The explanation for that behavior is simple. We have only a single node with a single GPU. We tried to run two models, each requesting a single GPU. The first one got it, while the second is left wondering why there are no GPUs available for it.
Now, there are two solutions for this problem. The obvious one would be to create another node with a GPU so that it can be assigned to the second model. If our models would be running at full capacity, that could be the solution. However, that’s not the case today. The first model is hardly used and it would be a waste of money to get an additional GPU for the second model that also might not be used all the time. A better solution could be to have those two models share the same GPU. After all, GPUs are very expensive and we might not have an infinite number of them.
Fortunately, we can tell Kubernetes nodes to partition GPUs. The way how to accomplish that might differ from one provider to another. Today, we’ll explore how to do it in Google Cloud and you should be able to adapt it to whichever provider you might be using.
We’ll start by destroying the node pool we created,…
gcloud container node-pools delete dot-gpu --project $PROJECT_ID \
--cluster dot --zone us-east1-b --quiet
…and create a new one, just as we did before but, this time, with an additional accelerator
instruction to partition GPU into 1g.5gb
sizes.
gcloud container node-pools create dot-gpu \
--project $PROJECT_ID --cluster dot --zone us-east1-b \
--machine-type a2-highgpu-1g --num-nodes 1 \
--no-enable-autoupgrade \
--accelerator type=nvidia-tesla-a100,count=1,gpu-driver-version=default,gpu-partition-size=1g.5gb
The gpu-partition-size
of 1g.5gb
refers to a GPU instance with one compute unit or 1/7th of streaming multiprocessors on the GPU, and one memory unit of 5 GB.
Now, to be clear, there’s a bit of dark magic involved so you might not be able to set partition sizes without checking the documentation. Don’t try to use intuition since it won’t get you far. Consult the documentation. Don’t try to find logic. Just do what it says.
Anyways… Now that we replaced the old with a new node that partitions GPU usage to up to 7 units, we can take another look at the Pods with Ollama models.
kubectl --namespace ollama get pods
The output is as follows (truncated for brevity).
NAME READY STATUS RESTARTS AGE
ollama-... 1/1 Running 0 7m50s
ollama2-... 1/1 Running 0 12m
We can see that, this time, both Pods are Running
. Even though each of them requested one GPU, the node itself now allows up to seven Pods to share that same GPU. Hence, both Pods are now running and I won’t go bankrupt.
That’s it. That’s all you should know to run models in containers inside a Kubernetes cluster. Nevertheless, before we part ways, I have a suggestion that might improve your resource utilization even more.
If the models are not used heavily and constantly, we might be able to optimize the setup even more by using Knative. Instead of having long-running Pods, with Knative we can have a system that scales unused Pods to zero or scales them up to whichever number is required to meet the demand. Some might call that serverless computing.
Now, even though Knative is typically used with “normal”, often stateless applications, I would argue that it could be a perfect match for running models that require GPUs. Please let me know in the comments if you’d like me to explore such a solution.
Another alternative could be to use KubeVirt to run models inside virtual machines managed by Kubernetes. I can explore that one as well. Just let me know in the comments.
Thank you for watching. See you in the next one. Cheers.
Destroy
chmod +x destroy.nu
./destroy.nu
exit