Replies: 2 comments 1 reply
-
The token returned also contains input prompt. I don't think computing the output will solve you problem. vllm/vllm/entrypoints/api_server.py Lines 64 to 67 in c6dfc3c |
Beta Was this translation helpful? Give feedback.
0 replies
-
I believe the main reason is that prompt tokens also take compute. There are different ways to measure throughput for LLMs, but I believe the trends between systems should be similar under different metrics. Move this issue to discussions for future questions. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
vllm/benchmarks/benchmark_throughput.py
Lines 177 to 180 in c894836
You can see that you just use the tokens' length of request in the dataset , not the real output length for the model to eval.
Beta Was this translation helpful? Give feedback.
All reactions