A benchmarking tool for measuring and comparing performance of LLM inference runtimes.
Turtlenekko is a benchmarking tool designed to measure the performance of Large Language Models (LLMs) through OpenAI API compatible chat completion endpoints. It provides reports on token processing speeds.
Many LLM benchmarking tools assume specific inference engines and only measure hardware performance (often for a short list of preselected models). While this is good for apples-to-apples comparisons of the most popular software stacks on assorted hardware, it falls short when it comes to less popular software/hardware combinations.
The main goal of Turtlenekko is benchmarking arbitrary inference runtimes as long as they expose an OpenAI API compatible(-ish) chat completions API. This enables users to evaluate less popular inference runtimes, especially those that employ non-mainstream hardware, such as custom accelerators.
Another Turtlenekko goal is exploring the space of different configurations. Nothing is fixed, and you can run different models against the same runtime. Why not measure how adding more threads improves performance? Does enabling some optimization (like flash attention) actually help? Turtlenekko can do all of the above (and anything else you can parameterize about the runtime) in a single run.
Data like this can help select the optimal configuration for your deployment, guide inference engine optimization efforts, or even act as a performance regression checker for your inference provider.
Key capabilities:
- Inference runtime agnostic - works with any LLM server that exposes an OpenAI API compatible chat completion endpoint
- Supports both local and remote LLM deployments
- Conducts parametrized benchmarks across multiple dimensions (models, CPU allocations, batch sizes, etc.)
- Tests all combinations of parameters through a flexible parameter matrix
- Generates detailed performance reports in multiple formats (JSON, CSV, text)
- Originally developed to benchmark NekkoAPI, but works with any compatible LLM server
- Repeats benchmark samples several times until result is reliable enough
- Measures KV cache reuse (making a call with the same prompt is expected to result in negligible prompt processing times)
Limitations:
- Doesn't support text completion API
- Doesn't measure performance of concurrent requests (thus ignoring benefits of continuous batching)
- Doesn't work correctly on model architectures that support dynamic attention where inference speed depends on the content of the prompt. This limitation is caused by pure gibberish being used as prompts.
- Slow. Like really slow. Turtlenekko itself is not slow, but multiple samples per run with multiple runs per matrix can take quite some time.
TODO:
- Measure performance of concurrent requests
- Use client-side token counter when tokenizer is known (should increase precision and lower benchmarking duration)
- Add support for text completions endpoint
- Add
exclude
and other filtering constructs to the parameter matrix - Implement web API to run benchmarks with precise control
go install github.com/aifoundry-org/turtlenekko@latest
Turtlenekko uses a configuration file to define benchmark parameters. You can create a default configuration file with:
turtlenekko init
Run a benchmark with:
turtlenekko benchmark --config config.yaml --format json
Available output formats:
json
: Structured JSON output for programmatic consumption and integration with other toolstext
: Human-readable text output for quick analysiscsv
: CSV format for spreadsheet analysis and data visualization
The JSON output provides detailed benchmark results in a structured format:
[
{
"params": {
"model": "llama3-7b",
"threads": "8"
},
"short_context_prompt_tokens_per_sec": 2380.95,
"short_context_cached_prompt_tokens_per_sec": 12500.00,
"short_context_completion_tokens_per_sec": 7.96,
"short_context_r_squared": 0.99,
"long_context_prompt_tokens_per_sec": 1123.60,
"long_context_cached_prompt_tokens_per_sec": 8333.33,
"long_context_completion_tokens_per_sec": 5.34,
"long_context_r_squared": 0.99,
"localscore_estimate": 20.95
},
{
"params": {
"model": "mistral-7b",
"threads": "4"
},
"short_context_prompt_tokens_per_sec": 1960.78,
"short_context_cached_prompt_tokens_per_sec": 10000.00,
"short_context_completion_tokens_per_sec": 10.17,
"short_context_r_squared": 0.99,
"long_context_prompt_tokens_per_sec": 952.38,
"long_context_cached_prompt_tokens_per_sec": 7142.86,
"long_context_completion_tokens_per_sec": 6.89,
"long_context_r_squared": 0.99,
"localscore_estimate": 21.88
}
]
Each object in the array represents one benchmark run with:
params
: The parameters used for this run (only those withoutput: true
)- Short context metrics (few hundred tokens):
short_context_prompt_tokens_per_sec
: Prompt tokens processed per secondshort_context_cached_prompt_tokens_per_sec
: Cached prompt tokens processed per second (KV cache reuse)short_context_completion_tokens_per_sec
: Completion tokens generated per secondshort_context_r_squared
: Statistical measure of how well the model fits the data (0-1)
- Long context metrics (around 3000 tokens):
long_context_prompt_tokens_per_sec
: Prompt tokens processed per secondlong_context_cached_prompt_tokens_per_sec
: Cached prompt tokens processed per second (KV cache reuse)long_context_completion_tokens_per_sec
: Completion tokens generated per secondlong_context_r_squared
: Statistical measure of how well the model fits the data (0-1)
localscore_estimate
: Estimated LocalScore - a composite performance score based on average prompt speed, generation speed, and responsiveness across both contexts
The CSV output is ideal for importing into spreadsheet applications:
model,threads,short_context_prompt_tokens_per_sec,short_context_cached_prompt_tokens_per_sec,short_context_completion_tokens_per_sec,short_context_r_squared,long_context_prompt_tokens_per_sec,long_context_cached_prompt_tokens_per_sec,long_context_completion_tokens_per_sec,long_context_r_squared,localscore_estimate
llama3-7b,8,2380.95,12500.00,7.96,0.99,1123.60,8333.33,5.34,0.99,20.95
mistral-7b,4,1960.78,10000.00,10.17,0.99,952.38,7142.86,6.89,0.99,21.88
The CSV includes:
- All parameters marked with
output: true
in the configuration - All performance metrics in a tabular format
- Headers for easy identification of columns
The text output provides a human-readable summary of each benchmark run:
=== Matrix Combination 1 ===
Parameters:
model: llama3-7b
threads: 8
Short Context Results:
Prompt processing: 2380.95 tokens/sec
Cached prompt processing: 12500.00 tokens/sec
Completion generation: 7.96 tokens/sec
Model fit quality (R²): 0.99
Long Context Results:
Prompt processing: 1123.60 tokens/sec
Cached prompt processing: 8333.33 tokens/sec
Completion generation: 5.34 tokens/sec
Model fit quality (R²): 0.99
Localscore Estimate: 20.95
Turtlenekko supports different drivers to manage the LLM runtime environment:
The dummy driver doesn't set up any environment and simply connects to an already running LLM server.
Configuration Example:
driver: "dummy"
matrix:
url:
values: ["http://localhost:8000/v1/chat/completions"]
output: true
model:
values: ["llama3"]
output: true
Parameters:
url
: The endpoint URL of the LLM server (required)model
: The model name to use (required)
The local_cmd driver executes shell commands to start and stop the LLM server before and after benchmarking. This is useful for testing different server configurations or when you need to manage the server lifecycle.
Configuration Example:
driver: "local_cmd"
matrix:
url:
values: ["http://localhost:8000/v1/chat/completions"]
output: false
model:
values: ["/models/llama3-7b.gguf", "/models/mistral-7b.gguf"]
output: true
threads:
values: ["4", "8"]
output: true
setup_cmd:
values: ["docker run -d --rm -p 8000:8000 -v ~/models:/models -e THREADS={{.threads}} -e MODEL_PATH={{.model}} llm-server:latest"]
output: false
teardown_cmd:
values: ["docker stop $(docker ps -q --filter ancestor=llm-server:latest)"]
output: false
Parameters:
url
: The endpoint URL of the LLM server (required)model
: The model name or path to use (required)setup_cmd
: Command to run before benchmarking (supports Go templates for parameter interpolation)teardown_cmd
: Command to run after benchmarking (supports Go templates)- Any additional parameters you want to test in your matrix
Template Variables:
The setup_cmd and teardown_cmd support Go template variables that are replaced
with the current parameter values. For example, {{.threads}}
will be replaced
with the current value of the "threads" parameter.
The matrix
section defines parameters to test in all possible combinations:
matrix:
parameter1:
values: ["value1", "value2"]
output: true # Include in results output
parameter2:
values: ["valueA", "valueB"]
output: false # Don't include in results output
Each parameter can be specified as:
- A simple array:
param: ["value1", "value2"]
- An object with values and output flag:
param: {values: ["value1", "value2"], output: true}
The output
flag controls whether the parameter appears in the benchmark results.
Turtlenekko uses a statistical approach to measure LLM performance metrics that cannot be directly controlled through the OpenAI API interface:
When benchmarking LLMs through a chat completion API, several challenges exist:
- We cannot precisely control the exact number of tokens processed
- The API returns total response time, but doesn't break down processing stages
- We have to actively prevent the runtime from using KV cache for subsequent requests to get representative results
To overcome these limitations, Turtlenekko:
- Samples Multiple Data Points: Runs benchmarks with varying prompt lengths and completion token limits
- Measures at Different Context Lengths: Takes measurements at both short context (few hundred tokens) and long context (around 10,000 tokens) to capture performance degradation as context grows
- Randomizes Prompts: Generates random prompts to prevent KV cache reuse
- Collects Measurements: For each run, records:
- Prompt token count (as reported by the API)
- Completion token count (as reported by the API)
- Total response time
- Fits Linear Regression Models: Uses the equation:
Separate models are fitted for short and long contexts.
response_time = prompt_rate * prompt_tokens + cached_prompt_rate * cached_prompt_tokens + completion_rate * completion_tokens
- Calculates Key Metrics:
- Prompt Processing Rate: Time per prompt token (milliseconds) for both short and long contexts
- Cached Prompt Processing Rate: Time per cached prompt token (milliseconds) when KV cache is reused
- Completion Generation Rate: Time per completion token (milliseconds) for both short and long contexts
- R-squared value: Indicates how well each model fits the data (0-1)
This approach allows Turtlenekko to:
- Separate the time spent on processing the input prompt from the time spent generating the completion
- Measure how performance degrades as context length increases
Turtlenekko calculates a composite performance metric called estimated LocalScore, inspired by the scoring system used in the LocalScore benchmark. The formula is:
LocalScore = (prompt_tps * gen_tps * (1000/ttft_ms))^(1/3) * 10
Where:
prompt_tps
: Average prompt tokens processed per second across both short and long contextsgen_tps
: Average completion tokens generated per second across both short and long contextsttft_ms
: Time to first token in milliseconds (calculated as average prompt tokens / prompt_tps * 1000)
The LocalScore is the geometric mean of these three metrics (with TTFT inverted since lower is better), multiplied by 10 for readability. This provides a single number that balances:
- Prompt processing speed
- Generation speed
- Responsiveness (via TTFT)
A higher estimated LocalScore indicates better overall performance. By averaging across both short and long contexts, the score reflects the model's performance across the entire context window range.
Turtlenekko can calculate LocalScore estimates for you (--localscore
command line argument, default: true).
If you wan't numbers that are somewhat comparable to the official LocalScore scores,
there are premade congurations in the examples/localscore
folder to benchmark NekkoAPI runtime
against the models used by LocalScore tool. Just run:
# To run localscore benchmarks you have to clone Turtlenekko repository first:
git clone https://github.com/aifoundry-org/turtlenekko.git
cd turtlenekko
# Run the benchmarks:
make localscore-tiny
# or
make localscore-small
# or
make localscore-medium
These will download corresponding models from Hugging Face and run the benchmarks.
[Note]: models are downloaded without authentication, rate limits may apply.