The Phenomenon of Latency Jumps in Inference #8202

UltraTar0 · 2024-09-05T16:37:39Z

UltraTar0
Sep 5, 2024

When I used vLLM to accelerate inference for Llama3 70b, I observed an interesting phenomenon. When each input request consists of 1 token, there is a latency jump every 256 requests. Does anyone know why this happens?

SeungminHeo · 2024-09-09T07:35:28Z

SeungminHeo
Sep 9, 2024

because default scheduler batch size is 256 so when reached 256 requests and nothing completed, additional requests will be Waited by scheduler within vLLM. so waiting time is added to latency, that is the phenomenon you observed i guess.

1 reply

UltraTar0 Sep 9, 2024
Author

Thank you！

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

The Phenomenon of Latency Jumps in Inference #8202

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

The Phenomenon of Latency Jumps in Inference #8202

Uh oh!

UltraTar0 Sep 5, 2024

Replies: 1 comment · 1 reply

Uh oh!

Uh oh!

SeungminHeo Sep 9, 2024

Uh oh!

UltraTar0 Sep 9, 2024 Author

UltraTar0
Sep 5, 2024

Replies: 1 comment 1 reply

SeungminHeo
Sep 9, 2024

UltraTar0 Sep 9, 2024
Author