Confusing memory allocation system gpu-memory-utilization
#8634
ExtReMLapin
announced in
Q&A
Replies: 1 comment
-
paged attention paper will answer your question better than I can here. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hello,
I'm working on a software that currently only uses llama.cpp as a backend.
Because of tensor parallelism and faster inference I'm thinking about also supporting vLLM.
With llama.cpp, you pick the model, set kv-cache quantization, context size and the model eats the VRAM it needs to eat.
What's confusing with vLLM is that you "let it" take it a percentage of the GPU, while it being a percentage is not really an issue as it can be turned into an absolute value, is it possible to not specify any VRAM usage, and instead some kind of "take what you need, that's my problem, not yours", exactly like llama.cpp
You can specify context len using
max-model-len
but why is it taking more vram if you let it ? What is it even doing with it ?With local llama server, you allocate yourself the context size (ctx*n_users) and you know your vram usage right after booting up, you know it can take two users at the time and vram usage will not increase.
Sorry but this "take x %" is super confusing.
Beta Was this translation helpful? Give feedback.
All reactions