This repository provides an automated workflow for benchmarking LLM inference using the llm-d
stack. It includes tools for deployment, experiment execution, data collection, and teardown across multiple environments and deployment styles.
Provide a single source of automation for repeatable and reproducible experiments and performance evaluation on llm-d
.
git clone https://github.com/llm-d/llm-d-benchmark.git
cd llm-d-benchmark
./setup/install_deps.sh
Standup an llm-d
stack model (default deployment method is llm-d-modelservice
, serving llama-1b
), run a harness (default vllm-benchmark
) with a load profile (default simple-random
) and teardown the stack
./e2e.sh
./run.sh --harness inference-perf --workload chatbot_synthetic --methods <a string that matches a inference service or pod>`
The benchmarking system drives synthetic or trace-based traffic into an llm-d-powered inference stack, orchestrated via Kubernetes. Requests are routed through a scalable load generator, with results collected and visualized for latency, throughput, and cache effectiveness.
Each benchmark run collects enough information to enable the execution on different clusters/environments with minimal setup effort
Multiple load generators and multiple load profiles available, in a plugable architecture that allows expansion
Define and measure a representative set of metrics that allows not only meaningful comparisons between different stacks, but also performance characterization for different components.
For a discussion of candidate relevant metrics, please consult this document
Category | Metric | Unit |
---|---|---|
Throughput | Output tokens / second | tokens / second |
Throughput | Input tokens / second | tokens / second |
Throughput | Requests / second | qps |
Latency | Time per output token (TPOT) | ms per output token |
Latency | Time to first token (TTFT) | ms |
Latency | Time per request (TTFT + TPOT * output length) | seconds per request |
Latency | Normalized time per output token (TTFT/output length +TPOT) aka NTPOT | ms per output token |
Latency | Inter Token Latency (ITL) - Time between decode tokens within a request | ms per output token |
Correctness | Failure rate | queries |
Experiment | Benchmark duration | seconds |
Define a mix of workloads that express real-world use cases, allowing for llm-d
performance characterization, evaluation, stress investigation.
For a discussion of relevant workloads, please consult this document
Workload | Use Case | ISL | ISV | OSL | OSV | OSP | Latency |
---|---|---|---|---|---|---|---|
Interactive Chat | Chat agent | Medium | High | Medium | Medium | Medium | Per token |
Classification of text | Sentiment analysis | Medium | Short | Low | High | Request | |
Classification of images | Nudity filter | Long | Low | Short | Low | High | Request |
Summarization / Information Retrieval | Q&A from docs, RAG | Long | High | Short | Medium | Medium | Per token |
Text generation | Short | High | Long | Medium | Low | Per token | |
Translation | Medium | High | Medium | Medium | High | Per token | |
Code completion | Type ahead | Long | High | Short | Medium | Medium | Request |
Code generation | Adding a feature | Long | High | Medium | High | Medium | Request |
llm-d-benchmark
follows the practice of its parent project (llm-d
) by having also it is own Northstar design (a work in progress)
Pieces of information identifying a particular cluster. This information includes, but it is not limited to, GPU model, llm model and llm-d parameters (an environment file, and optionally a values.yaml
file for modelservice helm charts)
Load Generator (python code) which drives the benchmark load. Today, llm-d-benchmark supports fmperf, inference-perf, guidellm and the benchmarks found on the benchmarks
folder on vllm. There are ongoing efforts to consolidate and provide an easier way to support different load generators.
Workload is the actual benchmark load specification which includes the LLM use case to benchmark, traffic pattern, input / output distribution and dataset. Supported workload profiles can be found under workload/profiles
.
Important
The triple <scenario>
,<harness>
,<workload>
, combined with the standup/teardown capabilities provided by llm-d-infra and llm-d-modelservice should provide enough information to allow an experiment to be reproduced.
- Instructions on how to contribute including details on our development process and governance.
- We use Slack to discuss development across organizations. Please join: Slack. There is a
sig-benchmarking
channel there. - We host a weekly standup for contributors on Thursdays at 13:30 ET. Please join: Meeting Details. The meeting notes can be found here. Joining the llm-d google groups will grant you access.
This project is licensed under Apache License 2.0. See the LICENSE file for details.