Question about server.cpp: loading prompt tokens, batch_view shrink, sampled token #6449

TD-Sky · 2024-04-03T03:26:15Z

TD-Sky
Apr 3, 2024

I'm reading the code of examples/server/server.cpp.

update_slots will set slot.state to PROCESSING after loading all prompt tokens:

https://github.com/ggerganov/llama.cpp/blob/52604860f93063ef98863921da697576af1c7665/examples/server/server.cpp#L1998-L2016

However, it won't call llama_sampling_sample if the last token whose logits set to true not in batch_view:

https://github.com/ggerganov/llama.cpp/blob/52604860f93063ef98863921da697576af1c7665/examples/server/server.cpp#L2102-L2106

Every time update_slots starts, sampled tokens will be added to batch for ongoing sequences. But now slot.sampled is uninitialized, then the whole completion should fail:

https://github.com/ggerganov/llama.cpp/blob/52604860f93063ef98863921da697576af1c7665/examples/server/server.cpp#L1738-L1767

It this a corner case bug? Did I lose any details?

Answered by ggerganov

Apr 4, 2024

However, it won't call llama_sampling_sample if the last token whose logits set to true not in batch_view:

I don't think this can ever happen since we always process the entire batch (in chunks/views of n_batch):

https://github.com/ggerganov/llama.cpp/blob/4399f13fb9462cd06f3f154d0aee738425000fea/examples/server/server.cpp#L2033-L2037

...

https://github.com/ggerganov/llama.cpp/blob/4399f13fb9462cd06f3f154d0aee738425000fea/examples/server/server.cpp#L2066-L2079

Each token with logits == true should fall in one of the batch views and will be processed.

View full answer

phymbert · 2024-04-03T07:04:18Z

phymbert
Apr 3, 2024
Collaborator

Thanks for the analysis. Ideally, if you can open an issue with the step to reproduce using the server test framework, this would be awesome.

2 replies

TD-Sky Apr 3, 2024
Author

I have no idea how to trigger llama_decode KV cache full as update_slots will shrink prompt tokens at first.

phymbert Apr 3, 2024
Collaborator

You can create a new feature in the server test framework with phi-2 Q4_0 and send // requests on a small kv cache size. You will easily reach:

https://github.com/ggerganov/llama.cpp/blob/master/examples/server/server.cpp#L2081-L2091

ggerganov · 2024-04-04T07:20:35Z

ggerganov
Apr 4, 2024
Maintainer

However, it won't call llama_sampling_sample if the last token whose logits set to true not in batch_view:

I don't think this can ever happen since we always process the entire batch (in chunks/views of n_batch):

https://github.com/ggerganov/llama.cpp/blob/4399f13fb9462cd06f3f154d0aee738425000fea/examples/server/server.cpp#L2033-L2037

...

https://github.com/ggerganov/llama.cpp/blob/4399f13fb9462cd06f3f154d0aee738425000fea/examples/server/server.cpp#L2066-L2079

Each token with logits == true should fall in one of the batch views and will be processed.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Question about server.cpp: loading prompt tokens, batch_view shrink, sampled token #6449

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Question about server.cpp: loading prompt tokens, batch_view shrink, sampled token #6449

Uh oh!

TD-Sky Apr 3, 2024

Replies: 2 comments · 2 replies

Uh oh!

phymbert Apr 3, 2024 Collaborator

Uh oh!

TD-Sky Apr 3, 2024 Author

Uh oh!

Uh oh!

phymbert Apr 3, 2024 Collaborator

Uh oh!

ggerganov Apr 4, 2024 Maintainer

TD-Sky
Apr 3, 2024

Replies: 2 comments 2 replies

phymbert
Apr 3, 2024
Collaborator

TD-Sky Apr 3, 2024
Author

phymbert Apr 3, 2024
Collaborator

ggerganov
Apr 4, 2024
Maintainer