Need help understanding llama-bench.exe outputs #9124

william-efstratis · 2024-08-21T19:41:02Z

william-efstratis
Aug 21, 2024

Hello, I am trying to understand the llama-bench.exe JSON outputs. Take this output for example: llamacpp_output.json

What are these metrics, how are they calculated? I am trying to find/calculate:

Overall Throughput (Tok/s) and Latency (ms) - end-2-end metrics
Prompt Processing Throughput (Tok/s) and Time to First Token Latency (ms) - prompt proc. phase metrics
Generation Throughput (Tok/s) and Latency (ms) - gen phase metrics

Also, quite confused the JSON outputs don't match the 'llama_print_timing' metrics with the '--verbose' flag: llamacpp_verbose_output.json. Not sure which tok/s or latency to use as I've used the the 'llama_print_timing' metrics before llama-bench was added.

Answered by slaren

Aug 21, 2024

The throughput in avg_ts includes the prompt tokens, ie. it's calculated as 200 tokens / 1814609460 ns. If you prefer to count only the generated tokens, you can calculate that yourself, which would give you the 55 t/s.

View full answer

slaren · 2024-08-21T20:15:55Z

slaren
Aug 21, 2024
Maintainer

llama-bench performs prompt processing (-p), generation (-n) and prompt processing + generation tests (-pg). Each test is repeated a number of times (-r), and the time of each repetition is reported in samples_ns (in nanoseconds), while avg_ns is the average of all the samples. samples_ts and avg_ts are the same results expressed in terms of tokens per second.

Overall Throughput (Tok/s) and Latency (ms) - end-2-end metrics

You can do this with a -pg test. For example -pg 512,128, will evaluate a prompt of 512 tokens, followed by generating 128 tokens.

Prompt Processing Throughput (Tok/s) and Time to First Token Latency (ms) - prompt proc. phase metrics

You can do this with a -p test.

Generation Throughput (Tok/s) and Latency (ms) - gen phase metrics

You can do this with a -n test.

Note that the tests are performed from an empty context, but in practice the performance decreases as the size of the context increases, so the results cannot be extrapolated to every situation.

Also, quite confused the JSON outputs don't match the 'llama_print_timing' metrics with the '--verbose' flag

llama-bench does its own timings separately. It also does and discards a warmup run before each test to skip first time initialization costs. The results from llama-bench should be more representative of what an user calling the llama.cpp API will see. Note that the results reported by llama-bench do not include sampling.

5 replies

william-efstratis Aug 21, 2024
Author

Thank you for the reply! How is throughput calculated with a pp100+tg100 test? Following the Overall Throughput calculations from TRT-LLM for example: https://github.com/NVIDIA/TensorRT-LLM/blob/main/benchmarks/cpp/gptSessionBenchmark.cpp#L285C21-L285C98. Does not give the same t/s found by llama.cpp:

.\llama-bench.exe -m "D:\Games\GGUF_Models_QuantFactory_Llama_3_8b_Q4_0\Meta-Llama-3-8B-Instruct.Q4_0.gguf" -p 100 -n 100 -pg 100,100 -fa 1 -ngl 99 -t 12 -o json -b 100
{
"build_commit": "5fd89a70",
"build_number": 3584,
....
"n_prompt": 100,
"n_gen": 100,
"test_time": "2024-08-21T21:25:39Z",
"avg_ns": 1814609460,
"stddev_ns": 1412366,
"avg_ts": 110.216607,
"stddev_ts": 0.085785,
"samples_ns": [ 1815110800, 1815213100, 1816393700, 1813220000, 1813109700 ],
"samples_ts": [ 110.186, 110.18, 110.108, 110.301, 110.308 ]
}

Whereas running calculations similar to TRT-LLM = 1 * (100 tokens generated / ~1.81 seconds) = ~55.12 Tok/s for Overall Throughput which does not match what is found for llama.cpp with 110.21 avg_ts (though ~55.12 * 2 == ~110.21, so I may be lost in the understanding of batching as well).

slaren Aug 21, 2024
Maintainer

The throughput in avg_ts includes the prompt tokens, ie. it's calculated as 200 tokens / 1814609460 ns. If you prefer to count only the generated tokens, you can calculate that yourself, which would give you the 55 t/s.

Answer selected by william-efstratis

william-efstratis Aug 21, 2024
Author

I see i see, thank you! That clears up things a lot, think I will stick with the TRT-LLM method of calculating to have somewhat of a comparison.

william-efstratis Aug 22, 2024
Author

@slaren When a test has a empty context does that mean the context is not considered at all? If there was a context size argument to llama-bench would we be able to extrapolate data from Prompt Proc and Generation phases of testing for lets say a Prompt+Gen test with 100 input tokens and 100 output tokens?

slaren Aug 22, 2024
Maintainer

The context is used normally, ie. the tokens are evaluated in sequence, and each token has the context of all the previous tokens. However, all the test start from an empty context, ie. at position zero in the KV cache, and the results cannot be extrapolated for other positions in the context window. This is nothing specific to llama.cpp, I was just pointing that generating tokens becomes more expensive as the size of the context increases due to the attention mechanism.

Need help understanding llama-bench.exe outputs #9124

Uh oh!

Uh oh!

william-efstratis Aug 21, 2024

Replies: 1 comment · 5 replies

Uh oh!

slaren Aug 21, 2024 Maintainer

Uh oh!

Uh oh!

william-efstratis Aug 21, 2024 Author

Uh oh!

Uh oh!

slaren Aug 21, 2024 Maintainer

Uh oh!

william-efstratis Aug 21, 2024 Author

Uh oh!

Uh oh!

william-efstratis Aug 22, 2024 Author

Uh oh!

slaren Aug 22, 2024 Maintainer

william-efstratis
Aug 21, 2024

Replies: 1 comment 5 replies

slaren
Aug 21, 2024
Maintainer

william-efstratis Aug 21, 2024
Author

slaren Aug 21, 2024
Maintainer

william-efstratis Aug 21, 2024
Author

william-efstratis Aug 22, 2024
Author

slaren Aug 22, 2024
Maintainer