Help! Llama.cpp server Stream Freeze current request and continue after processing the new request. #9367

AnonymousVibrate · 2024-09-08T11:32:15Z

AnonymousVibrate
Sep 8, 2024

I'm new to Llama.cpp server I used Streamlit to make my GUI but it happened like this :
It freeze the first request then continue after the 2nd request process is done.

llama-server.exe -m mistral-7b-grok.Q5_K_M.gguf -c 20000 -np 5 -ngl 32 -fa --threads-http 16 --chat-template llama3 -cb --ignore-eos -n -1 --mlock

Streamlit.mp4

Answered by ngxson

Sep 9, 2024

You can also try to lower the batch, try -b 32. Be careful that lower batch have big impact on performance.

Also it seems like you're running on CPU, so the default batch size 2048 is significantly long.

View full answer

steampunque · 2024-09-08T14:04:17Z

steampunque
Sep 8, 2024

I also see server freezes after https://github.com/ggerganov/llama.cpp/releases/tag/b3655 and https://github.com/ggerganov/llama.cpp/releases/tag/b3678 patches modified the server scheduling. Should probably be converted to a debug issue.

7 replies

ngxson Sep 9, 2024
Collaborator

This is expected and not a bug. When you send the 2nd request, a batch will be created which contains both the 2nd prompt and next token of the first request. Because they are in the same batch, the whole batch must be finished in order to continue the 1st request.

This is not a bug because in general, it makes server to do process larger batch, improve the overall performance.

steampunque Sep 9, 2024

This is expected and not a bug. When you send the 2nd request, a batch will be created which contains both the 2nd prompt and next token of the first request. Because they are in the same batch, the whole batch must be finished in order to continue the 1st request.

This is not a bug because in general, it makes server to do process larger batch, improve the overall performance.

Going into a freeze state after a prompt is entered after starting the server does not sound like a desired behavior, particularly when server slots is set to 1 to force sequential processing of the prompts to get deterministic results.

ngxson Sep 9, 2024
Collaborator

You can also try to lower the batch, try -b 32. Be careful that lower batch have big impact on performance.

Also it seems like you're running on CPU, so the default batch size 2048 is significantly long.

Answer selected by AnonymousVibrate

ngxson Sep 9, 2024
Collaborator

Going into a freeze state after a prompt is entered after starting the server does not sound like a desired behavior, particularly when server slots is set to 1 to force sequential processing of the prompts to get deterministic results.

I don't get what you mean. If -np 1 then 2nd request must always be queued and be process after 1st request.

The current issue is for -np 5. If you think it's a different issue, please create a dedicated issue.

steampunque Sep 9, 2024

Going into a freeze state after a prompt is entered after starting the server does not sound like a desired behavior, particularly when server slots is set to 1 to force sequential processing of the prompts to get deterministic results.

I don't get what you mean. If -np 1 then 2nd request must always be queued and be process after 1st request.

The current issue is for -np 5. If you think it's a different issue, please create a dedicated issue.

The behavior I saw after the server scheduler updates was a complete freeze of the server on first prompt sent with -np 1 and batch size 128 and cuda backend. The server just sat there on the first prompt and hung up. After I ctl-c and send another prompt it then started responding, or if I stopped the server and restarted it it would sometimes start responding on first prompt (suggesting an uninitialized variable somewhere with behavior dependent on whatever happends to be sitting in memory at the moment when the program is started). Unfortunately I cannot replicate the problem in latest release. If I see it happen again I will try to get it to be repeatable and create a separate issue (this issue may be different but related to the issue created in this discussion).

ngxson Sep 9, 2024
Collaborator

There was one issue being addressed from the first refactoring: 4a1411b

Maybe you experienced the issue before the fix being merged.

steampunque Sep 9, 2024

There was one issue being addressed from the first refactoring: 4a1411b

Maybe you experienced the issue before the fix being merged.

Yes, very good chance. I was rebasing my server patch (I run a heavily modified server with a bunch of experimental features) on every new update and the problem suddently stopped showing up at one point.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Help! Llama.cpp server Stream Freeze current request and continue after processing the new request. #9367

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 7 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Help! Llama.cpp server Stream Freeze current request and continue after processing the new request. #9367

Uh oh!

AnonymousVibrate Sep 8, 2024

Replies: 1 comment · 7 replies

Uh oh!

steampunque Sep 8, 2024

Uh oh!

ngxson Sep 9, 2024 Collaborator

Uh oh!

steampunque Sep 9, 2024

Uh oh!

ngxson Sep 9, 2024 Collaborator

Uh oh!

ngxson Sep 9, 2024 Collaborator

Uh oh!

steampunque Sep 9, 2024

Uh oh!

ngxson Sep 9, 2024 Collaborator

Uh oh!

steampunque Sep 9, 2024

AnonymousVibrate
Sep 8, 2024

Replies: 1 comment 7 replies

steampunque
Sep 8, 2024

ngxson Sep 9, 2024
Collaborator

ngxson Sep 9, 2024
Collaborator

ngxson Sep 9, 2024
Collaborator

ngxson Sep 9, 2024
Collaborator