Skip to content

rh-aiservices-bu/disconnected-rhaiis-guidellm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Red Hat AI Inference Server - RHAIIS and GuideLLM on OpenShift

This repository will guide you through deploying RHAIIS and GuideLLM on OpenShift.

Before we deploy the components, we will first create two persistent volumes one for the model weights to be used by RHAIIS, and the other one to store the tokenizer for GuideLLM and the Whisper Benchmark image

These instructions will use a model located in local-models/llama, downloaded from RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w4a16

Setup Persistent Volume for model weights

This persistent volume will store the model weights, follow these instructions to create the PVC and copy the model weights from your local machine

# Create persistent volume claim
oc create -f - <<EOF
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: llama-31-8b-instruct-w4a16
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 50Gi
EOF
# Create temporary pod to copy model
oc create -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
  name: model-copy-pod
spec:
  containers:
  - name: model-copy
    image: registry.access.redhat.com/ubi9/ubi:latest
    command: ["sleep", "3600"]
    volumeMounts:
    - name: models
      mountPath: /mnt/models
  volumes:
  - name: models
    persistentVolumeClaim:
      claimName: llama-31-8b-instruct-w4a16
EOF
oc wait --for=condition=Ready pod/model-copy-pod
# Copy model from local machine
oc rsync ./local-models/llama/ model-copy-pod:/mnt/models/
# Clean up
oc delete pod model-copy-pod

Tokenizer PVC

This persistent volume will store the tokenizer to be used by guidellm, follow these instructions to create the PVC and copy the tokenizer from your local machine

# Create persistent volume claim
oc create -f - <<EOF
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: llama-31-8b-instruct-w4a16-tokenizer
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi
EOF
# Create temporary pod to copy model
oc create -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
  name: tokenizer-copy-pod
spec:
  containers:
  - name: tokenizer-copy
    image: registry.access.redhat.com/ubi9/ubi:latest
    command: ["sleep", "3600"]
    volumeMounts:
    - name: tokenizer
      mountPath: /mnt/tokenizer
  volumes:
  - name: tokenizer
    persistentVolumeClaim:
      claimName: llama-31-8b-instruct-w4a16-tokenizer
EOF
oc wait --for=condition=Ready pod/tokenizer-copy-pod
# Copy model from local machine
oc cp ./local-models/llama/tokenizer.json tokenizer-copy-pod:/mnt/tokenizer/
oc cp ./local-models/llama/config.json tokenizer-copy-pod:/mnt/tokenizer/
# Clean up
oc delete pod tokenizer-copy-pod

Check Resource Limits

Before deploying RHAIIS, check if there are any ResourceQuotas or LimitRanges in your namespace that might restrict resource allocation for the model serving pod:

# Check ResourceQuotas
oc get resourcequotas -o yaml

# Check LimitRanges  
oc get limitranges -o yaml

# Check current resource usage
oc describe resourcequotas
oc describe limitranges

If you see restrictive limits on CPU, memory, or GPU resources, you may need to:

  • Request quota increases from your cluster administrator
  • Modify the limits in your namespace
  • Override resource requests/limits in your Helm values

Deploy RHAIIS

There are two options to deploy RHAIIS, by applying the objects in the rhaiis/openshift folder, or by using Helm charts. The Helm charts option is more flexible because we can change things like the model and pvc names.

To deploy RHAIIS configured for the RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w4a16 model, run:

oc apply -f rhaiis/openshift

To deploy using the Helm chart, run:

helm install llama-31-8b-instruct-w4a16 ./rhaiis/helm \
--set model.pvcName=llama-31-8b-instruct-w4a16 

Any of the Helm chart values can be overidden for example:

  • model.pvcName
  • model.servedModelName
  • resources.gpu

Configuring these values should allow for diffenent models, with different GPU requirements, and model weight persistent volumes.

Uninstall RHAIIS

To uninstall the simple openshift deployment run:

oc delete -f rhaiis/openshift

To uninstall using helm, run:

helm uninstall llama-31-8b-instruct-w4a16 

Deploy GuideLLM

There are two options to deploy GuideLLM, by applying the objects in the guidellm/openshift folder, or by using Helm charts. The Helm charts option is more flexible because we can change things like the model and pvc names.

To deploy using OpenShift objects run:

oc apply -f guidellm/openshift

To deploy using the Helm chart, run:

helm install guidellm ./guidellm/helm \
--set benchmark.target=http://llama-31-8b-instruct-w4a16-rhaiis:8000 \
--set benchmark.model=meta-llama-3.1-8B-instruct-quantized.w4a16 \
--set storage.tokenizerPvcName=llama-31-8b-instruct-w4a16-tokenizer 

Any of the Helm chart values can be overidden for example:

  • benchmark.maxRequests
  • benchmark.rateType

Configuring these values should allow for diffenent models, with different GPU requirements, and model weight persistent volumes.

To uninstall the simple openshift deployment run:

oc delete -f guidellm/openshift

To uninstall using helm run:

helm uninstall guidellm 

Deploy Whisper Benchmark

There are two options to deploy the Whisper benchmark, by applying the objects in the whisper-benchmark/openshift folder, or by using Helm charts. The Helm charts option is more flexible because we can change configuration values.

To deploy using OpenShift objects run:

oc apply -f whisper-benchmark/openshift

To deploy using the Helm chart, run:

helm install whisper-benchmark ./whisper-benchmark/helm \
--set benchmark.openaiApiBaseUrl=http://whisper-large-v2-rhaiis:8000 \
--set benchmark.targetModelName=openai/whisper-large-v2 \
--set storage.modelPvcName=whisper-tokenizer

Any of the Helm chart values can be overidden for example:

  • benchmark.concurrentRequests
  • benchmark.language
  • resources.requests.memory
  • resources.limits.memory

To uninstall the simple openshift deployment run:

oc delete -f whisper-benchmark/openshift

To uninstall using helm run:

helm uninstall whisper-benchmark 

Results

Once the GuideLLM tests have completed you will see results shown like:

================================================================================
Metadata           | Request Stats                   | Out Tok/sec | Tot Tok/sec | Req Latency (sec) 
                   |                                   |            |             | mean
                   |                                   |            |             | median | p99
                   | TTFT (ms)                         | ITL (ms)   | TPOT (ms)   |
Benchmark          | Per Second   | Concurrency        | mean       | mean        | mean 
                   |              |                    | median     | p99         | mean   | median | p99   | mean   | median | p99
-------------------|--------------|--------------------|------------|-------------|--------|--------|-------|--------|--------|-------
synchronous        | 0.85         | 1.00               | 108.4      | 325.4       | 1.18
                   |              |                    | 1.18       | 1.18        | 29.7   | 29.4   | 37.0  | 9.1    | 9.1    | 9.1
throughput         | 19.31        | 92.45              | 2495.8     | 7489.3      | 4.74
                   |              |                    | 4.73       | 4.91        | 988.2  | 950.5  | 1879.2| 29.5   | 29.2   | 37.2
constant@3.15      | 3.13         | 4.04               | 400.4      | 1201.5      | 1.29
                   |              |                    | 1.29       | 1.36        | 37.2   | 35.7   | 95.3  | 9.9    | 9.9    | 10.1
constant@5.46      | 5.40         | 7.54               | 690.5      | 2072.0      | 1.40
                   |              |                    | 1.39       | 1.64        | 45.3   | 34.2   | 235.4 | 10.6   | 10.7   | 11.0
constant@7.77      | 7.56         | 11.41              | 967.4      | 2902.8      | 1.51
                   |              |                    | 1.50       | 1.76        | 45.3   | 36.1   | 197.5 | 11.5   | 11.5   | 12.5
constant@10.08     | 9.76         | 17.89              | 1249.7     | 3749.8      | 1.83
                   |              |                    | 1.84       | 2.13        | 61.0   | 39.4   | 260.1 | 13.9   | 14.1   | 15.5
constant@12.38     | 11.79        | 23.21              | 1508.7     | 4526.9      | 1.97
                   |              |                    | 1.97       | 2.33        | 62.0   | 40.9   | 275.2 | 15.0   | 15.2   | 17.4
constant@14.69     | 13.85        | 31.94              | 1772.6     | 5319.2      | 2.31
                   |              |                    | 2.37       | 2.77        | 86.3   | 46.2   | 342.8 | 17.5   | 18.3   | 20.8
constant@17.00     | 15.41        | 40.82              | 1972.2     | 5917.7      | 2.65
                   |              |                    | 2.80       | 3.37        | 119.8  | 51.4   | 490.9 | 19.9   | 21.6   | 24.5
constant@19.31     | 17.01        | 55.03              | 2177.3     | 6533.4      | 3.23
                   |              |                    | 3.24       | 4.26        | 166.4  | 61.5   | 569.8 | 24.2   | 25.1   | 31.2
================================================================================

Once the Whisper Benchmarking tool is complete you will see results e.g.

WER: 100.0
Total Requests: 479
Mean TTFT: 142.9556276326482
95th Percentile TTFT: 309.7570 ms
Mean ITL: 142.9556276326482
95th Percentile ITL: 309.7570 ms
Average Latency: 0.1430 seconds
Estimated req_Throughput: 216.48 requests/s
Estimated Throughput: 2164.80 tok/s
Benchmark finished. Final WER: 100.00%

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •