Confusing memory allocation system `gpu-memory-utilization` #8634

ExtReMLapin · 2024-09-19T14:01:21Z

ExtReMLapin
Sep 19, 2024

Hello,

I'm working on a software that currently only uses llama.cpp as a backend.

Because of tensor parallelism and faster inference I'm thinking about also supporting vLLM.

With llama.cpp, you pick the model, set kv-cache quantization, context size and the model eats the VRAM it needs to eat.

What's confusing with vLLM is that you "let it" take it a percentage of the GPU, while it being a percentage is not really an issue as it can be turned into an absolute value, is it possible to not specify any VRAM usage, and instead some kind of "take what you need, that's my problem, not yours", exactly like llama.cpp

You can specify context len using max-model-len but why is it taking more vram if you let it ? What is it even doing with it ?

With local llama server, you allocate yourself the context size (ctx*n_users) and you know your vram usage right after booting up, you know it can take two users at the time and vram usage will not increase.

Sorry but this "take x %" is super confusing.

fullstackwebdev · 2024-10-05T04:08:20Z

fullstackwebdev
Oct 5, 2024

paged attention paper will answer your question better than I can here.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Confusing memory allocation system `gpu-memory-utilization` #8634

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Confusing memory allocation system gpu-memory-utilization #8634

Uh oh!

ExtReMLapin Sep 19, 2024

Replies: 1 comment

Uh oh!

fullstackwebdev Oct 5, 2024

Confusing memory allocation system `gpu-memory-utilization` #8634

ExtReMLapin
Sep 19, 2024

fullstackwebdev
Oct 5, 2024