Skip to content

dmitripikus/llm-d-benchmark

 
 

Repository files navigation

llm-d-benchmark

This repository provides an automated workflow for benchmarking LLM inference using the llm-d stack. It includes tools for deployment, experiment execution, data collection, and teardown across multiple environments and deployment styles.

Goal

Provide a single source of automation for repeatable and reproducible experiments and performance evaluation on llm-d.

📦 Repository Setup

git clone https://github.com/llm-d/llm-d-benchmark.git
cd llm-d-benchmark
./setup/install_deps.sh

Quickstart

Standup an llm-d stack model (default deployment method is llm-d-modelservice, serving llama-1b), run a harness (default vllm-benchmark) with a load profile (default simple-random) and teardown the stack

./e2e.sh

Run harness inference-perf with load profile chatbot_synthetic againsta a pre-deployed stack

./run.sh --harness inference-perf --workload chatbot_synthetic --methods <a string that matches a inference service or pod>`

Architecture

The benchmarking system drives synthetic or trace-based traffic into an llm-d-powered inference stack, orchestrated via Kubernetes. Requests are routed through a scalable load generator, with results collected and visualized for latency, throughput, and cache effectiveness.

llm-d Logo

Goals

Reproducibility

Each benchmark run collects enough information to enable the execution on different clusters/environments with minimal setup effort

Flexibility

Multiple load generators and multiple load profiles available, in a plugable architecture that allows expansion

Well defined set of Metrics

Define and measure a representative set of metrics that allows not only meaningful comparisons between different stacks, but also performance characterization for different components.

For a discussion of candidate relevant metrics, please consult this document

Category Metric Unit
Throughput Output tokens / second tokens / second
Throughput Input tokens / second tokens / second
Throughput Requests / second qps
Latency Time per output token (TPOT) ms per output token
Latency Time to first token (TTFT) ms
Latency Time per request (TTFT + TPOT * output length) seconds per request
Latency Normalized time per output token (TTFT/output length +TPOT) aka NTPOT ms per output token
Latency Inter Token Latency (ITL) - Time between decode tokens within a request ms per output token
Correctness Failure rate queries
Experiment Benchmark duration seconds

Relevant collection of Workloads

Define a mix of workloads that express real-world use cases, allowing for llm-d performance characterization, evaluation, stress investigation.

For a discussion of relevant workloads, please consult this document

Workload Use Case ISL ISV OSL OSV OSP Latency
Interactive Chat Chat agent Medium High Medium Medium Medium Per token
Classification of text Sentiment analysis Medium Short Low High Request
Classification of images Nudity filter Long Low Short Low High Request
Summarization / Information Retrieval Q&A from docs, RAG Long High Short Medium Medium Per token
Text generation Short High Long Medium Low Per token
Translation Medium High Medium Medium High Per token
Code completion Type ahead Long High Short Medium Medium Request
Code generation Adding a feature Long High Medium High Medium Request

Design and Roadmap

llm-d-benchmark follows the practice of its parent project (llm-d) by having also it is own Northstar design (a work in progress)

Main concepts (identified by specific directories)

Scenarios

Pieces of information identifying a particular cluster. This information includes, but it is not limited to, GPU model, llm model and llm-d parameters (an environment file, and optionally a values.yaml file for modelservice helm charts)

Harness

Load Generator (python code) which drives the benchmark load. Today, llm-d-benchmark supports fmperf, inference-perf, guidellm and the benchmarks found on the benchmarks folder on vllm. There are ongoing efforts to consolidate and provide an easier way to support different load generators.

Workload

Workload is the actual benchmark load specification which includes the LLM use case to benchmark, traffic pattern, input / output distribution and dataset. Supported workload profiles can be found under workload/profiles.

Important

The triple <scenario>,<harness>,<workload>, combined with the standup/teardown capabilities provided by llm-d-infra and llm-d-modelservice should provide enough information to allow an experiment to be reproduced.

Dependecies

Topics

Contribute

  • Instructions on how to contribute including details on our development process and governance.
  • We use Slack to discuss development across organizations. Please join: Slack. There is a sig-benchmarking channel there.
  • We host a weekly standup for contributors on Thursdays at 13:30 ET. Please join: Meeting Details. The meeting notes can be found here. Joining the llm-d google groups will grant you access.

License

This project is licensed under Apache License 2.0. See the LICENSE file for details.

About

llm-d benchmark scripts and tooling

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Shell 58.2%
  • Python 25.3%
  • Jupyter Notebook 10.3%
  • Makefile 3.9%
  • Dockerfile 2.3%