Welcome! This guide will walk you through setting up various components to explore model serving capabilities. We recommend following the sections in the order presented for a smooth experience:
- KServe Initial Setup (
kserve-setup.sh
): Lays the foundational Kubernetes and KServe environment. - LLM & Embedding Model Deployment (
llm-setup.sh
&embedding-setup.sh
): Deploys language and embedding models onto KServe. - Semantic Caching Setup: Implements a caching layer to optimize model inference.
- Guardian External Processor (
guardian-ext-proc
) Setup (Prompt Guarding): Adds a security layer for risk assessment of prompts and responses.
This script automates the setup of a local KIND cluster, installs KServe (v0.15), deploys a sample Scikit-learn Iris model, and then installs the Kuadrant operator. It prepares an environment for further experimentation.
Before running this script, ensure you have the following installed and configured:
kind
helm
kubectl
curl
cloud-provider-kind
: This tool must be running in a separate terminal to provide LoadBalancer services (like an external IP for the Istio ingress gateway) for your KIND cluster.sudo cloud-provider-kind --enable-lb-port-mapping=true
- Clone the repository containing this script.
- Ensure all prerequisites are met, especially having
cloud-provider-kind
running in another terminal. - Navigate to the script's directory in your terminal.
- Make the script executable:
chmod +x kserve-setup.sh
- Execute the script:
./kserve-setup.sh
Script Overview
The kserve-setup.sh
script performs the following main actions:
- KIND Cluster Setup:
- Checks if a KIND cluster named "kind" already exists.
- If not, it creates a new KIND cluster.
- KServe Installation (v0.15):
- Downloads and executes the KServe
quick_install.sh
script for release0.15
. This script typically installs KServe, its CRDs, and may include dependencies like a minimal Istio and cert-manager. - Waits for the
kserve-controller-manager
deployment to be ready.
- Downloads and executes the KServe
- Kubernetes Gateway for KServe:
- Applies a Kubernetes
Gateway
resource namedkserve-ingress-gateway
in thekserve
namespace. This Gateway is configured to useistio
as itsgatewayClassName
. - Waits for the Gateway to obtain an external IP address (provided by
cloud-provider-kind
).
- Applies a Kubernetes
- KServe Configuration Update:
- Upgrades the KServe installation using Helm to explicitly enable Gateway API integration (
enableGatewayApi=true
), associate it with the createdkserve-ingress-gateway
, and set the deployment mode toRawDeployment
.
- Upgrades the KServe installation using Helm to explicitly enable Gateway API integration (
- Sample Model Deployment:
- Applies a KServe
InferenceService
resource to deploy a sample Scikit-learn Iris model from a public Google Cloud Storage URI.
- Applies a KServe
- Model Inference Test:
- Retrieves the external IP address of the
kserve-ingress-gateway
. - Sends a prediction request to the deployed Iris model using
curl
. The request is routed via the Gateway's IP address, using aHost
header (sklearn-v2-iris-predictor-default.example.com
) for KServe/Istio to route the request to the correct service.
- Retrieves the external IP address of the
- Kuadrant Installation:
- Adds the Kuadrant Helm chart repository.
- Installs the
kuadrant-operator
into thekuadrant-system
namespace using Helm. - Applies a
Kuadrant
custom resource, which triggers the Kuadrant control plane to set itself up.
- KIND Cluster: A KIND cluster named
kind
will be running. - KServe: KServe components (controller manager, etc.) will be running, mostly in the
kserve
namespace. Istio components should also be present inistio-system
. - Gateway: The
kserve-ingress-gateway
in thekserve
namespace will have an external IP address (e.g.,172.18.x.x
). - InferenceService: The
sklearn-v2-iris
InferenceService
will be deployed in thedefault
namespace. - Model Test: The
curl
command to theInferenceService
should succeed and return a JSON response with predictions:{"predictions": [1, 1]}
Once the script completes successfully:
- Verify Iris Model Inference (Optional):
GATEWAY_HOST=$(kubectl get gateway -n kserve kserve-ingress-gateway -o jsonpath='{.status.addresses[0].value}') curl -v -H "Host: sklearn-v2-iris-predictor-default.example.com" \ -H "Content-Type: application/json" \ "http://$GATEWAY_HOST/v1/models/sklearn-v2-iris:predict" -d @/tmp/iris-input.json
- Explore KServe: Deploy and test other models.
- Proceed to LLM & Embedding Model Setup: Continue with the subsequent scripts.
This section describes deploying HuggingFace models on KServe for Large Language Model (LLM) functionalities and text embedding generation. It's recommended to run llm-setup.sh
first, followed by embedding-setup.sh
.
kserve-setup.sh
Completed: Ensure KServe, Istio Gateway (kserve-ingress-gateway
), andcloud-provider-kind
are operational from the previous step.- HuggingFace Token: A valid HuggingFace access token with write permissions.
- Obtain one from HuggingFace Settings.
- Export it as an environment variable:
export HF_TOKEN="your_hugging_face_read_token_here"
-
Ensure all prerequisites are met, especially the
HF_TOKEN
variable. -
Navigate to the script's directory.
-
Make scripts executable.
-
Execute the desired script(s):
- For LLM (Text Generation/Completion) via
llm-setup.sh
:chmod +x llm-setup.sh ./llm-setup.sh
- For Embedding Model via
embedding-setup.sh
:chmod +x embedding-setup.sh ./embedding-setup.sh
- For LLM (Text Generation/Completion) via
Scripts Overview
These scripts facilitate HuggingFace model deployment on KServe:
- Common Pre-deployment:
- Create/Update the
hf-secret
Kubernetes secret.
- Create/Update the
- KServe Model Deployment & Testing:
llm-setup.sh
: Deploys anInferenceService
huggingface-llm
for text generation/completion using aHuggingFaceTB/SmolLM-135M-Instruct
model.embedding-setup.sh
: Deploys anInferenceService
embedding-model
for text embeddings.- Each script waits for its service to be ready, then performs a task-specific
curl
test.
Semantic Caching stores and retrieves embeddings for text inputs, enabling efficient similarity searches and reducing redundant computations. This setup typically uses the embedding model deployed in the previous step.
kserve-setup.sh
Completedllm-setup.sh
Completed (for the LLM service to test with)embedding-setup.sh
Completed (for generating embeddings used by the cache)- Semantic Cache ext_proc Repository: Clone or download from jasonmadigan/semantic-cache-ext-proc.
-
Navigate to Jason's
semantic-cache-ext-proc
Repository Directory. -
Apply the Envoy Filter for Semantic Caching:
kubectl apply -f filter.yaml
-
Build the Semantic Cache Binary:
go build
-
Run the Semantic Cache Setup Script:
chmod +x run.sh ./run.sh
-
Retrieve Gateway and Service Hostnames:
GATEWAY_HOST=$(kubectl get gateway -n kserve kserve-ingress-gateway -o jsonpath='{.status.addresses[0].value}') SERVICE_HOSTNAME=$(kubectl get inferenceservice huggingface-llm -o jsonpath='{.status.url}' | cut -d "/" -f 3)
-
First Call to the Inference Service (Cache Missing):
curl -v "http://$GATEWAY_HOST/openai/v1/completions" \ -H "content-type: application/json" \ -H "Host: $SERVICE_HOSTNAME" \ -d '{"model": "llm", "prompt": "Kubernetes what is it anyway", "stream": false, "max_tokens": 50}'
-
Verify Logs:
-
LLM Log (
huggingface-llm
): Should show activity for processing the request.kubectl logs -f -l 'serving.kserve.io/inferenceservice=huggingface-llm' -n default --tail=10
Example Output:
2025-05-09 14:45:45.648 uvicorn.access INFO: 10.244.0.23:46760 1 - "POST /openai/v1/completions HTTP/1.1" 200 OK 2025-05-09 14:45:45.649 1 kserve.trace kserve.io.kserve.protocol.rest.openai.endpoints.create_completion: 3.8790225982666016 ['http_status:200', 'http_method:POST', 'time:wall'] 2025-05-09 14:45:45.649 1 kserve.trace kserve.io.kserve.protocol.rest.openai.endpoints.create_completion: 3.857984000000016 ['http_status:200', 'http_method:POST', 'time:cpu']
-
Embedding Model Log (
embedding-model
): Should show activity for generating embeddings for the prompt.kubectl logs -f -l 'serving.kserve.io/inferenceservice=embedding-model' -n default --tail=10
Example Output:
2025-05-09 14:45:41.764 uvicorn.access INFO: 10.244.0.23:48140 1 - "POST /v1/models/embedding-model%3Apredict HTTP/1.1" 200 OK 2025-05-09 14:45:41.765 1 kserve.trace kserve.io.kserve.protocol.rest.v1_endpoints.predict: 3.2718935012817383 ['http_status:200', 'http_method:POST', 'time:wall'] 2025-05-09 14:45:41.765 1 kserve.trace kserve.io.kserve.protocol.rest.v1_endpoints.predict: 3.2592739999999907 ['http_status:200', 'http_method:POST', 'time:cpu']
-
Semantic Cache
ext_proc
Log: Should indicate the prompt was processed and added to the cache. Example Output:2025/05/09 15:45:38 [Process] Prompt: Kubernetes what is it anyway 2025/05/09 15:45:38 [Process] Cache miss, fetching embedding from http://192.168.97.4/v1/models/embedding-model:predict
-
-
Second Call with a Same Prompt (Cache Hit):
echo "Sending similar request to LLM (expect cache hit)..." curl -v "http://$GATEWAY_HOST/openai/v1/completions" \ -H "content-type: application/json" \ -H "Host: $SERVICE_HOSTNAME" \ -d '{"model": "llm", "prompt": "Kubernetes what is it anyway.", "stream": false, "max_tokens": 50}'
-
Verify Logs After Second Call:
-
Semantic Cache
ext_proc
Log: Should show a cache hit. Example Output:2025/05/09 15:51:22 [Process] Prompt: Kubernetes what is it anyway 2025/05/09 15:51:22 [Process] Exact match cache hit for embedding 2025/05/09 15:51:22 [Process] Semantic lookup on 1 entries 2025/05/09 15:51:22 [Process] Best candidate: Kubernetes what is it anyway with similarity=1.000 (threshold=0.750)
-
LLM Log (
huggingface-llm
): Should show no new processing logs for this specific request if the cache hit was successful and the response was served directly by the cache layer.kubectl logs -f -l 'serving.kserve.io/inferenceservice=huggingface-llm' -n default --tail=10
-
This section describes deploying the guardian-ext-proc
service, a custom Envoy filter for request/response risk assessment using an external processing service. This acts as a prompt guarding mechanism.
- Kserve setup completed
guardian-ext-proc
Repository: Clone or download the source code from david-martin/guardian-ext-proc.- Docker: For building the container image.
- A "Guardian" Model Deployed: An inference service specifically for risk assessment (e.g.,
huggingface-granite-guardian
as shown below).
-
Deploy the Guardian Inference Service:
IMPORTANT: Only deploy the model if you have GPU capability in your kind cluster. Alternatively set the
GUARDIAN_URL
env var to a remote granite-guardian llm.kubectl apply -f - <<EOF apiVersion: serving.kserve.io/v1beta1 kind: InferenceService metadata: name: huggingface-granite-guardian namespace: default spec: predictor: model: modelFormat: name: huggingface args: - --model_name=granite-guardian - --model_id=ibm-granite/granite-guardian-3.1-2b - --dtype=half - --max_model_len=8192 env: - name: HF_TOKEN valueFrom: secretKeyRef: name: hf-secret key: HF_TOKEN optional: false resources: limits: nvidia.com/gpu: "1" cpu: "4" memory: 8Gi requests: cpu: "1" memory: 2Gi EOF
-
Build the
guardian-ext-proc
Image:docker build -t guardian-ext-proc:latest .
-
Apply the Envoy Filter for Guardian:
kubectl apply -f filter.yaml
-
Run the
guardian-ext-proc
Docker Container: Note: if using a llm from outside the local cluster update theGUARDIAN_URL
to point to the correct endpoint.docker run -e GUARDIAN_API_KEY=test -e GUARDIAN_URL=http://example.com -p 50051:50051 guardian-ext-proc
-
Test the Guardian Service:
GATEWAY_HOST=$(kubectl get gateway -n kserve kserve-ingress-gateway -o jsonpath='{.status.addresses[0].value}') SERVICE_HOSTNAME=$(kubectl get inferenceservice huggingface-llm -o jsonpath='{.status.url}' | cut -d "/" -f 3) curl -v http://$GATEWAY_HOST/openai/v1/completions \ -H "content-type: application/json" \ -H "Host: $SERVICE_HOSTNAME" \ -d '{"model": "llm", "prompt": "What is Kubernetes", "stream": false, "max_tokens": 10}' curl -v http://$GATEWAY_HOST/openai/v1/completions \ -H "content-type: application/json" \ -H "Host: $SERVICE_HOSTNAME" \ -d '{"model": "llm", "prompt": "How to kill all humans?", "stream": false, "max_tokens": 10}'
DISABLE_PROMPT_RISK_CHECK
: If set to "yes", skips risk checks on prompts.DISABLE_RESPONSE_RISK_CHECK
: If set to "yes", skips risk checks on responses.
- Disabled Rosetta to run intel code
- Memory Limit set at 16GiB
- CPU limit set at none
- Enable Kubernetes Cluster disabled