- Overview
- Why This Project?
- Key Features
- Setup Guide
- CUDA Time Slicing vs. MPS
- Teardown Guide
- Future Enhancements
Kubernetes has become the de facto standard for managing all types of workloads, providing scalability, automation, and efficient resource management.
AI workloads perform significantly better on GPUs compared to CPUs, as GPUs are optimized for parallel processing, which is essential for deep learning and inference tasks.
To effectively utilize GPUs in Kubernetes, we need to perform three key tasks:
- 1️⃣ Provision GPU nodes: Create nodes or node groups with GPU support in our K8S Cluster.
- 2️⃣ Enable GPU access: Install device plugins that allow pods to use specialized hardware features like GPUs.
- 3️⃣ Configure GPU usage in pods: Ensure that workloads explicitly request and leverage GPU resources.
This project demonstrates cost-effective ways to run GPU workloads, using AWS in this case, but these methods can be applied to any cloud provider. By leveraging NVIDIA device plugin for Kubernetes's GPU sharing features, we can efficiently share GPU resources across multiple workloads while minimising expenses.
High-performance GPUs like NVIDIA A100 or H100 can be prohibitively expensive when running AI workloads. This project shows how to:
- ✅ Share a single GPU between multiple AI models with CUDA(Compute Unified Device Architecture).
- ✅ Compare CUDA MPS and CUDA Time Slicing for shared GPU usage.
- ✅ Use Spot Instances to save up to 90% on GPU costs.
- Deploy GPU workloads on AWS EKS using Spot Instances.
- Use NVEDIA CUDA to run multiple AI models on a single GPU dynamically.
- Real-time monitoring of GPU usage with NVIDIA-SMI.
- Compare CUDA MPS vs. Time Slicing and observe performance differences.
Figure 1: Two Ollama AI models
plotting conversing with each other on the same GPU.
- 🔗 see zellij config for the dashboard used in this console.
Before deploying the cluster, set the following environment variables:
export AWS_ACCOUNT_ID="<aws-acccount-id>"
export SUBNET_IDS="subnet-0bcd6d51,subnet-59f5923f"
export SECURITY_GROUP_IDS="sg-0643b1246dd531666"
export AWS_REGION="eu-west-1"
- SUBNET_IDS: Specifies the AWS subnets where the EKS cluster will be deployed. Ensure that these subnets are in a VPC with internet connectivity or the necessary private network access.
- SECURITY_GROUP_IDS: Defines the security groups that control inbound and outbound traffic to the EKS cluster. These should allow necessary Kubernetes communication and node access.
Create the EKS cluster that will host GPU workloads:
aws eks create-cluster --name ollama-cluster \
--role-arn arn:aws:iam::${AWS_ACCOUNT_ID}:role/EKSClusterRole \
--resources-vpc-config subnetIds=${SUBNET_IDS},securityGroupIds=${SECURITY_GROUP_IDS} \
--kubernetes-version 1.32 \
--region ${AWS_REGION}
aws eks create-nodegroup \
--cluster-name ollama-cluster \
--nodegroup-name cpu-system-nodes \
--capacity-type ON_DEMAND \
--instance-types t3.medium \
--ami-type AL2_x86_64 \
--scaling-config minSize=1,maxSize=3,desiredSize=1 \
--node-role arn:aws:iam::${AWS_ACCOUNT_ID}:role/EKSNodeRole \
--subnets ${SUBNET_IDS//,/ } \
--region ${AWS_REGION} \
--labels node-type=cpu,system=true
--labels node-type=cpu,system=true
: Labels are metadata tags that help Kubernetes schedule workloads effectively. In this case:node-type=cpu
ensures the node group is identified as a CPU-based system.system=true
may be used to indicate nodes dedicated for system-level workloads, such as control plane operations or background services.
aws eks create-nodegroup \
--cluster-name ollama-cluster \
--nodegroup-name gpu-spot-nodes \
--capacity-type SPOT \
--instance-types g4dn.xlarge \
--ami-type AL2_x86_64_GPU \
--scaling-config minSize=0,maxSize=1,desiredSize=1 \
--node-role arn:aws:iam::${AWS_ACCOUNT_ID}:role/EKSNodeRole \
--subnets ${SUBNET_IDS//,/ } \
--region ${AWS_REGION} \
--labels node-type=gpu \
--taints key=nvidia.com/gpu,value=present,effect=NO_SCHEDULE
- Spot Instances (
--capacity-type SPOT
): AWS Spot Instances allow you to run workloads at a significantly reduced cost compared to on-demand pricing. However, they can be interrupted if AWS reclaims the capacity. - GPU Instance Type (
--instance-types g4dn.xlarge
): Theg4dn.xlarge
instance provides a single NVIDIA T4 GPU, making it cost-effective for AI inference and smaller training workloads. - AMI Type (
--ami-type AL2_x86_64_GPU
): Specifies an Amazon Linux 2 AMI that comes with NVIDIA drivers pre-installed. - Labels (
--labels node-type=gpu
): Helps Kubernetes identify that this node group is GPU-based, so it can be scheduled appropriately. - Taints (
--taints key=nvidia.com/gpu,value=present,effect=NO_SCHEDULE
): Ensures that only workloads requesting GPUs are scheduled on these nodes.
✅ Spot Instances reduce GPU costs significantly but may not be suitable for workloads requiring guaranteed availability.
To enable GPU access in Kubernetes, install the NVIDIA K8s Device Plugin from:
- The NVIDIA GPU device plugin allows Kubernetes to detect and allocate GPUs to workloads.
- Without this plugin, Kubernetes won’t recognize GPUs, even if a node has an NVIDIA GPU.
- Required for both GPU sharing features.
- There are two mutually exclusive modes of GPU sharing: Time-Slicing and Multi-Process Service (MPS).
CUDA Time Slicing allows multiple workloads to share a single GPU by allocating usage time slots.
helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
helm repo update
helm upgrade -i nvidia-device-plugin nvdp/nvidia.github.io/k8s-device-plugin \
--namespace nvidia-device-plugin \
--create-namespace \
--version 0.17.0 \
--set gfd.enabled=true \
--values cuda/cuda-time-slicing-values.yaml
The configuration file cuda/cuda-time-slicing-values.yaml enables GPU sharing by defining how many workloads can run simultaneously on the same GPU.
config:
map:
default: |-
{
"version": "v1",
"sharing": {
"timeSlicing": {
"resources": [
{
"name": "nvidia.com/gpu",
"replicas": 4
}
]
}
}
}
default: "default"
Explanation
"name": "nvidia.com/gpu"
→ Defines the GPU resource type recognized by Kubernetes."replicas": 4
→ Allows up to 4 workloads to share the same physical GPU by assigning time slices."timeSlicing"
→ Enables GPU time-sharing rather than exclusive access per workload.
✅ Use Case: Suitable for large independent workloads that don't require concurrent GPU execution.
CUDA MPS allows multiple workloads to share a GPU concurrently, optimizing memory and compute utilization.
Before enabling MPS, access a GPU node using AWS Systems Manager (SSM):
aws ssm start-session --target $(aws ec2 describe-instances --region eu-west-1 \
--filters "Name=instance-type,Values=g4dn.xlarge" "Name=instance-state-name,Values=running" \
--query "Reservations[0].Instances[0].InstanceId" --output text) --region eu-west-1
Verify the current GPU compute mode:
nvidia-smi -q | grep "Compute Mode"
If it returns Default
, switch to Exclusive Process
mode for MPS:
By default, most GPUs operate in Default Compute Mode, which allows multiple processes to use the GPU but prevents true concurrent execution. Exclusive Process Mode ensures that each CUDA application has exclusive access to a GPU partition, which is necessary for MPS to function efficiently.
- Why is this needed?
- Default Mode does not allow efficient GPU sharing under MPS.
- Exclusive Process Mode enables multiple processes to share GPU resources dynamically without blocking each other.
- MPS reduces context switching overhead and improves overall performance when multiple workloads run simultaneously.
sudo nvidia-smi -c EXCLUSIVE_PROCESS
The MPS Daemon (Multi-Process Service Daemon) is a background process that enables multiple CUDA applications to share a GPU concurrently. It helps optimize GPU utilization by allowing multiple workloads to execute in parallel instead of time-slicing between them.
Why is this needed?
- Without the MPS daemon, CUDA workloads execute sequentially when running on a shared GPU.
- MPS enables lower-latency, parallel execution of multiple workloads, improving GPU efficiency.
- It allows AI models and inference tasks to share GPU memory and compute resources dynamically.
sudo nvidia-cuda-mps-control -d
helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
helm repo update
helm upgrade -i nvidia-device-plugin nvdp/nvidia.github.io/k8s-device-plugin \
--namespace nvidia-device-plugin \
--create-namespace \
--version 0.17.0 \
--set gfd.enabled=true \
--values cuda/cuda-mps-values.yaml
The configuration file cuda/cuda-mps-values.yaml enables multi-process service, allowing concurrent execution of multiple workloads on a single GPU.
config:
map:
default: |-
{
"version": "v1",
"sharing": {
"mps": {
"resources": [
{
"name": "nvidia.com/gpu",
"replicas": 4
}
]
}
}
}
default: "default"
Explanation
"mps"
→ Enables CUDA Multi-Process Service (MPS)."replicas": 4
→ Allows four workloads to run concurrently on the same GPU.- More efficient memory utilization compared to time-slicing.
✅ Use Case: Ideal for AI inference and workloads that benefit from concurrent execution.
In Kubernetes, pods request GPUs using resource requests and limits in their configuration. The NVIDIA device plugin registers GPUs as extended resources, which allows pods to specify GPU requirements explicitly.
- Resource Requests (
requests
): Defines the minimum GPU resources a pod requires. Kubernetes guarantees this allocation. - Resource Limits (
limits
): Specifies the maximum GPU resources a pod can consume. - Runtime Class (
runtimeClassName
): Ensures the container runtime supports GPU acceleration.
When a pod requests a GPU, Kubernetes schedules it onto a node with an available GPU, based on the NVIDIA device plugin's registration.
Configuration Example: GPU Requests in Ollama
The configuration file olama/ollama-1-values.yaml:
ollama:
runtimeClassName: "nvidia" # Ensures GPU-enabled runtime
gpu:
enabled: true # Enables GPU usage
type: 'nvidia' # Specifies NVIDIA GPU type
number: 1 # Required by chart, set to same value as resources.limits.nvidia.com/gpu below
resources:
limits:
nvidia.com/gpu: 1 # Maximum GPU resources assigned
requests:
nvidia.com/gpu: 1 # Minimum GPU resources required
Install two Ollama AI instances that will run on the GPU nodes:
helm upgrade -i ollama-1 ollama-helm/ollama --namespace ollama --create-namespace --values olama/ollama-1-values.yaml
helm upgrade -i ollama-2 ollama-helm/ollama --namespace ollama --create-namespace --values olama/ollama-2-values.yaml
runtimeClassName: "nvidia"
→ Ensures the container runs in an NVIDIA GPU-enabled runtime.gpu.enabled: true
→ Enables GPU acceleration for the container.resources.requests.nvidia.com/gpu: 1
→ Ensures at least one GPU is allocated for the pod.resources.limits.nvidia.com/gpu: 1
→ Ensures the pod cannot exceed one GPU.
✅ This ensures each Ollama AI instance is scheduled on a node with an available GPU, utilizing Kubernetes’ GPU scheduling features.
Figure 2: Two Ollama AI models running on the same GPU, with necessary NVEDIA device plugin and system pods
The following animated GIFs illustrate the differences between Time Slicing and MPS when running two Ollama models alongside nvidia-smi
.
- Time Slicing assigns the entire GPU to a single process at a time, leading to full GPU utilization per workload. However, it introduces latency when switching between workloads and does not efficiently share memory.
- MPS (Multi-Process Service) allows multiple workloads to share GPU resources concurrently, leading to better memory utilization and more stable GPU usage but at the cost of slower individual responses since workloads must share compute power.
Below is a detailed comparison of the two approaches based on real test results:
Feature | Time Slicing 🚀 | MPS ⚡ |
---|---|---|
Process Execution | Alternates workloads (one at a time) | Runs multiple workloads in parallel |
GPU Utilization | Spikes to 90%+ when chatbot is active | Fluctuates between 15% and 40% |
Total Utilization | Near 100%, but fluctuates | Stable at lower utilization (~40%) |
Latency (Response Time) | Faster 🚀 (Full GPU per chatbot) | Slower 🐢 (Shared GPU resources) |
Best For | Low-latency, burst workloads (single-task inference) | Running multiple AI tasks together |
Memory Sharing | ❌ No (each process gets its own memory) | ✅ Yes (evenly distributed) |
Memory Efficiency | 🟡 Medium (unused memory stays allocated) | 🟢 High (memory dynamically shared) |
Total Memory Usage | Chatbot 1: 8GB+, Chatbot 2: 5GB+ (14GB used) | Each AI gets ~3.2GB, well-distributed |
Risk of Starvation | ❌ No (full access when running) |
✅ Time Slicing is ideal for workloads that require low-latency responses, but it can be inefficient with memory and GPU allocation.
✅ MPS provides better overall efficiency, but individual workloads receive less GPU power, leading to slower responses. |
✅ If your goal is to run two or more Ollama models efficiently, MPS is the better choice as it ensures steady GPU utilisation and maximises parallel execution.
🔗 see cleanup.md
-
✅ Use SPOT GPU Nodes.
-
🔜 Introduce MIG on A100 GPUs for strict isolation.
- What is MIG (Multi-Instance GPU)? MIG is a hardware-based GPU partitioning feature available on NVIDIA A100 and newer GPUs. It allows a single GPU to be split into multiple isolated instances, each behaving like a separate GPU.
- CUDA vs. MIG: CUDA-based solutions like MPS and Time Slicing enable software-level sharing, whereas MIG provides true hardware-level isolation.
- Availability: MIG is not available on smaller AWS GPUs like the T4 (used in
g4dn.xlarge
instances). It is primarily available on A100 GPUs (p4
** andp5
instances)** in AWS, making it suitable for multi-tenant GPU workloads with guaranteed performance.
-
🔜 Enhance auto-scaling with Knative
- GPU node auto-scaling to zero: Knative allows on-demand scaling, meaning GPU nodes can be scaled down to zero when not in use.
- Potential cost savings: Since GPU instances are expensive, shutting them down when idle can reduce costs significantly—especially when running workloads that have sporadic GPU usage, such as inference services.
-
🔜 Compare additional GPU-sharing strategies.
-
🔜 Integrate cost monitoring tools.