Skip to content

Commit 3cfd28c

Browse files
authored
fix: NIM Pattern Enhancement 20240716 (#588)
1 parent c9b2208 commit 3cfd28c

File tree

3 files changed

+47
-16
lines changed

3 files changed

+47
-16
lines changed

ai-ml/nvidia-triton-server/nvidia-nim.tf

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -124,6 +124,7 @@ resource "helm_release" "nim_llm" {
124124
]
125125

126126
depends_on = [
127-
null_resource.download_nim_deploy
127+
null_resource.download_nim_deploy,
128+
module.eks_blueprints_addons.ingress_nginx
128129
]
129130
}

gen-ai/inference/nvidia-nim/nim-client/client.py

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -55,7 +55,10 @@ async def main(FLAGS):
5555
"top_k": 20,
5656
"max_tokens": 512,
5757
}
58-
client = openai.AsyncOpenAI(base_url=FLAGS.url)
58+
client = openai.AsyncOpenAI(
59+
base_url=FLAGS.url,
60+
api_key="not_used_for_self_host", # To avoid report OPENAI_API_KEY missing
61+
)
5962
with open(FLAGS.input_prompts, "r") as file:
6063
print(f"Loading inputs from `{FLAGS.input_prompts}`...")
6164
prompts = file.readlines()

website/docs/gen-ai/inference/nvidia-nim-llama3.md

Lines changed: 41 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,8 @@ NIMs are packaged as container images on a per model/model family basis. Each NI
3030

3131
![NIM Architecture](img/nim-architecture.png)
3232

33+
Source: https://docs.nvidia.com/nim/large-language-models/latest/introduction.html#architecture
34+
3335
## Overview of this deployment pattern on Amazon EKS
3436

3537
This pattern combines the capabilities of NVIDIA NIM, Amazon Elastic Kubernetes Service (EKS), and various AWS services to deliver a high-performance and cost-optimized model serving infrastructure.
@@ -52,6 +54,9 @@ By combining these components, our proposed solution delivers a powerful and cos
5254

5355
Before getting started with NVIDIA NIM, ensure you have the following:
5456

57+
<details>
58+
<summary>Click to expand the NVIDIA NIM account setup details</summary>
59+
5560
**NVIDIA AI Enterprise Account**
5661

5762
- Register for an NVIDIA AI Enterprise account. If you don't have one, you can sign up for a trial account using this [link](https://enterpriseproductregistration.nvidia.com/?LicType=EVAL&ProductFamily=NVAIEnterprise).
@@ -87,6 +92,7 @@ echo "$NGC_API_KEY" | docker login nvcr.io --username '$oauthtoken' --password-s
8792
docker pull nvcr.io/nim/meta/llama3-8b-instruct:latest
8893
```
8994
You do not have to wait for it to complete, just to make sure the API key is valid to pull the image.
95+
</details>
9096

9197
The following are required to run this tutorial
9298
- An active AWS account with admin equivalent permissions
@@ -319,11 +325,14 @@ kubectl apply -f genaiperf-deploy.yaml
319325
```
320326

321327
Once the pod is ready with running status `1/1`, can execute into the pod.
328+
322329
```bash
323330
export POD_NAME=$(kubectl get po -l app=tritonserver -ojsonpath='{.items[0].metadata.name}')
324331
kubectl exec -it $POD_NAME -- bash
325332
```
333+
326334
Run the testing to the deployed NIM Llama3 model
335+
327336
```bash
328337
genai-perf \
329338
-m meta/llama3-8b-instruct \
@@ -342,6 +351,7 @@ genai-perf \
342351
--profile-export-file my_profile_export.json \
343352
--url nim-llm.nim:8000
344353
```
354+
345355
You should see similar output like the following
346356

347357
```bash
@@ -362,20 +372,19 @@ You should be able to see the [metrics](https://docs.nvidia.com/deeplearning/tri
362372

363373
To understand the command line options, please refer to [this documentation](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/client/src/c%2B%2B/perf_analyzer/genai-perf/README.html#command-line-options).
364374

365-
366375
## Observability
376+
367377
As part of this blueprint, we have also deployed the Kube Prometheus stack, which provides Prometheus server and Grafana deployments for monitoring and observability.
368378

369379
First, let's verify the services deployed by the Kube Prometheus stack:
370380

371381
```bash
372-
kubectl get svc -n kube-prometheus-stack
382+
kubectl get svc -n monitoring
373383
```
374384

375385
You should see output similar to this:
376386

377387
```text
378-
kubectl get svc -n kube-prometheus-stack
379388
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
380389
kube-prometheus-stack-grafana ClusterIP 172.20.225.77 <none> 80/TCP 10m
381390
kube-prometheus-stack-kube-state-metrics ClusterIP 172.20.237.248 <none> 8080/TCP 10m
@@ -394,7 +403,9 @@ kubectl port-forward -n nim svc/nim-llm 8000
394403
curl localhost:8000/metrics # run this in another terminal
395404
```
396405

397-
We also provided a pre-configured Grafana dashboard. In the Grafana dashboard below, it contains several important metrics:
406+
### Grafana Dashboard
407+
408+
We provides a pre-configured Grafana dashboard to better visualize NIM status. In the Grafana dashboard below, it contains several important metrics:
398409

399410
- **Time to First Token (TTFT)**: The latency between the initial inference request to the model and the return of the first token.
400411
- **Inter-Token Latency (ITL)**: The latency between each token after the first.
@@ -404,34 +415,50 @@ You can find more metrics description from this [document](https://docs.nvidia.c
404415

405416
![NVIDIA LLM Server](img/nim-dashboard.png)
406417

407-
You can visualize these metrics using the Grafana. To view the Grafana dashboard to monitor these metrics, follow the steps below:
418+
To view the Grafana dashboard to monitor these metrics, follow the steps below:
408419

409-
```bash
410-
- Port-forward Grafana service:
411-
kubectl port-forward svc/kube-prometheus-stack-grafana 3000:80 -n kube-prometheus-stack
420+
<details>
421+
<summary>Click to expand details</summary>
422+
423+
**1. Retrieve the Grafana password.**
412424

413-
- Grafana Admin user
414-
admin
425+
The password is saved in the AWS Secret Manager. Below Terraform command will show you the secret name.
415426

416-
- Get secret name from Terraform output
427+
```bash
417428
terraform output grafana_secret_name
429+
```
418430

419-
- Get admin user password
431+
Then use the output secret name to run below command,
432+
433+
```bash
420434
aws secretsmanager get-secret-value --secret-id <grafana_secret_name_output> --region $AWS_REGION --query "SecretString" --output text
421435
```
422436

423-
**Login to Grafana:**
437+
**2. Expose the Grafana Service**
438+
439+
Use port-forward to expose the Grafana service.
440+
441+
```bash
442+
kubectl port-forward svc/kube-prometheus-stack-grafana 3000:80 -n monitoring
443+
```
444+
445+
**3. Login to Grafana:**
424446

425447
- Open your web browser and navigate to [http://localhost:3000](http://localhost:3000).
426448
- Login with the username `admin` and the password retrieved from AWS Secrets Manager.
427449

428-
**Open the NIM Monitoring Dashboard:**
450+
**4. Open the NIM Monitoring Dashboard:**
429451

430452
- Once logged in, click "Dashboards" on the left sidebar and search "nim"
431453
- You can find the Dashboard `NVIDIA NIM Monitoring` from the list
432454
- Click and entering to the dashboard.
433455

434456
You should now see the metrics displayed on the Grafana dashboard, allowing you to monitor the performance your NVIDIA NIM service deployment.
457+
</details>
458+
459+
:::info
460+
As of writing this guide, NVIDIA also provides an example Grafana dashboard. You can check it from [here](https://docs.nvidia.com/nim/large-language-models/latest/observability.html#grafana).
461+
:::
435462

436463
## Cleanup
437464

0 commit comments

Comments
 (0)