Replies: 2 comments
-
5 good looking call |
Beta Was this translation helpful? Give feedback.
0 replies
-
it's a hard check, you can specify |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I thought vLLMs dynamic mgmt of memory would allow the GPU memory to handle as much KVCache as it could aka Growing the KV Cache one page at a time as the KVCache grows and constantly deleting finished queries and reallocating memory.
However, when putting Llama-3.1-8b on a 24G 4090, it errors out that not enough space is available for a FULL 128k context length.
I am effectively running the vLLM offline inference example but with Llama-3.1-8B (https://docs.vllm.ai/en/v0.5.5/getting_started/examples/offline_inference.html)
What am I miss understanding here?
Beta Was this translation helpful? Give feedback.
All reactions