-
Hello, I'm trying to better understand the /metrics output from llama-server. Specifically how the is Another question I had is which model parameters can we change to control the length of the output prompt? Thank you! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
Try using |
Beta Was this translation helpful? Give feedback.
Thanks for the answer @dspasyuk . It pointed me in the right direction!
I found that the function
update_slots
These lines will start processing the prompts from the slots within the server, which are considered
initial prompt tokens
:https://github.com/ggerganov/llama.cpp/blob/0fff7fd79818980763a601660f25b01a0cf4b87a/examples/server/server.cpp#L1874-L1880
Then these lines will checks whether the prompt exceeds the context size (slot.n_ctx). If so, it truncates the input to fit within
n_ctx
https://github.com/ggerganov/llama.cpp/blob/0fff7fd79818980763a601660f25b01a0cf4b87a/examples/server/server.cpp#L1936-L1958
Here, the function is managing the cache where it reuses for efficiency, like…