Looking for help understanding llama-server /metrics #10325

Allan-Luu · 2024-11-15T22:11:12Z

Allan-Luu
Nov 15, 2024

Hello,

I'm trying to better understand the /metrics output from llama-server.

Specifically how the llamacpp:prompt_tokens_seconds and llamacpp:predicted_tokens_seconds are calculated.

is llamacpp:prompt_tokens_seconds calculated based on the initial input? From my understanding, the input prompt is converted to tokens for a prediction, then the prediction output is used as another input, tokenized to generate the token of the next character/word? Which parts of this process is used in the calculation for llamacpp:prompt_tokens_seconds.

Another question I had is which model parameters can we change to control the length of the output prompt?

Thank you!

Answered by Allan-Luu

Nov 17, 2024

Thanks for the answer @dspasyuk . It pointed me in the right direction!

I found that the function update_slots

These lines will start processing the prompts from the slots within the server, which are considered initial prompt tokens:

https://github.com/ggerganov/llama.cpp/blob/0fff7fd79818980763a601660f25b01a0cf4b87a/examples/server/server.cpp#L1874-L1880

Then these lines will checks whether the prompt exceeds the context size (slot.n_ctx). If so, it truncates the input to fit within n_ctx

https://github.com/ggerganov/llama.cpp/blob/0fff7fd79818980763a601660f25b01a0cf4b87a/examples/server/server.cpp#L1936-L1958

Here, the function is managing the cache where it reuses for efficiency, like…

View full answer

dspasyuk · 2024-11-17T14:41:36Z

dspasyuk
Nov 17, 2024

Try using grep -r prompt_tokens_seconds in llama.cpp folder you will find:
{"name", "predicted_tokens_seconds"}, {"help", "Average generation throughput in tokens/s."}, {"value", n_tokens_predicted ? 1.e3 / t_tokens_generation * n_tokens_predicted : 0.}
in server.cpp

1 reply

Allan-Luu Nov 17, 2024
Author

Thanks for the answer @dspasyuk . It pointed me in the right direction!

I found that the function update_slots

These lines will start processing the prompts from the slots within the server, which are considered initial prompt tokens:

https://github.com/ggerganov/llama.cpp/blob/0fff7fd79818980763a601660f25b01a0cf4b87a/examples/server/server.cpp#L1874-L1880

Then these lines will checks whether the prompt exceeds the context size (slot.n_ctx). If so, it truncates the input to fit within n_ctx

https://github.com/ggerganov/llama.cpp/blob/0fff7fd79818980763a601660f25b01a0cf4b87a/examples/server/server.cpp#L1936-L1958

Here, the function is managing the cache where it reuses for efficiency, like my initial assumptions where the output of one portion of the processed prompt is reused to predict the new tokens.

https://github.com/ggerganov/llama.cpp/blob/0fff7fd79818980763a601660f25b01a0cf4b87a/examples/server/server.cpp#L1961-L2019

Then we can find the tokens being added up to n_batch

https://github.com/ggerganov/llama.cpp/blob/0fff7fd79818980763a601660f25b01a0cf4b87a/examples/server/server.cpp#L2054-L2064

And finally once the prompt is fully processed (slot.n_past == slot.n_prompt_tokens), the code marks the prompt as done (SLOT_STATE_DONE_PROMPT) and processes it through the sampler system:

https://github.com/ggerganov/llama.cpp/blob/0fff7fd79818980763a601660f25b01a0cf4b87a/examples/server/server.cpp#L2068-L2088

The variable slot.n_prompt_tokens_processed keeps track of how many tokens have been processed. The number of prompt tokens is reduced in steps as they are added to the batch, with each iteration increasing slot.n_past and updating slot.n_prompt_tokens_processed. The process involves:

Initializing slot.n_prompt_tokens with the total count of tokens.
Iterating over tokens and adding them to the batch, counting processed tokens.

Answer selected by Allan-Luu

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Looking for help understanding llama-server /metrics #10325

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Looking for help understanding llama-server /metrics #10325

Uh oh!

Allan-Luu Nov 15, 2024

Replies: 1 comment · 1 reply

Uh oh!

dspasyuk Nov 17, 2024

Uh oh!

Allan-Luu Nov 17, 2024 Author

Allan-Luu
Nov 15, 2024

Replies: 1 comment 1 reply

dspasyuk
Nov 17, 2024

Allan-Luu Nov 17, 2024
Author