vLLM CPU Phi 3 mini 128K instruct - OOM issues #5059

thealmightygrant · 2024-05-26T20:19:58Z

thealmightygrant
May 26, 2024

Hi y'all, I'm trying out vLLM on Phi 3 with no GPU, and I seem to be hitting some OOM issues with the model.

These are the configurations that I am running with:

CUDA_VISIBLE_DEVICES="-1" VLLM_CPU_KVCACHE_SPACE="26" \
python3 -m vllm.entrypoints.openai.api_server \
               --trust-remote-code \
               --gpu-memory-utilization 0.0 \
               --device cpu
               --swap-space 3
               --dtype bfloat16
               --max-model-len 32768
               --model microsoft/Phi-3-mini-128k-instruct
               --tokenizer microsoft/Phi-3-mini-128k-instruct

I'm running in docker with 32GB of memory available and 12 CPU cores. I've looked at the memory requirements for the model, and I can't quite fathom how this model is not able to not OOM on me. If I do not set the --max-model-len, then I am not able to get anywhere with this and I receive errors similar to this:

[rank0]: ValueError: The model's max seq len (131072) is larger than the maximum number of tokens that can be stored in KV cache (70992). Try increasing `VLLM_CPU_KVCACHE_SPACE` or decreasing `max_model_len` when initializing the engine

It never seems to have enough memory. 🤔

thealmightygrant · 2024-05-26T20:23:24Z

thealmightygrant
May 26, 2024
Author

Oh, in case it helps, I am running vLLM from commit: 2ba80bed2732edf42b1014ea4e34757849fc93d0.

0 replies

thealmightygrant · 2024-05-26T20:27:44Z

thealmightygrant
May 26, 2024
Author

Wow, okay, so an interesting followup. I bumped this down to microsoft/Phi-3-mini-4k-instruct, and it still OOMs with 32GB of RAM available to it. 😂 😭

0 replies

thealmightygrant · 2024-05-26T21:01:30Z

thealmightygrant
May 26, 2024
Author

Okay, same issue with commit 8e192ff967b44b186ea02d30e49fddf656fdfe50. Backing off to v0.4.2 and trying again.

0 replies

thealmightygrant · 2024-05-26T21:08:19Z

thealmightygrant
May 26, 2024
Author

Okay, same issue with version v0.4.2 of vLLM. Any ideas of what to try next?

0 replies

thealmightygrant · 2024-05-27T14:28:36Z

thealmightygrant
May 27, 2024
Author

I gave the container twice the amount of memory that is given to the KV cache with the VLLM_CPU_KVCACHE_SPACE env var, and this seems to do the trick! 🎉 No more OOMs. Is anyone aware of roughly what % of memory should go to the KV cache? Apparently 80% is too much 😂

2 replies

anencore94 Oct 10, 2024

Thanks for sharing! How much did you assign for VLLM_CPU_KVCACHE_SPACE and the memory of the container, finally ? @thealmightygrant

thealmightygrant Nov 25, 2024
Author

Sorry for the delay, this is what I ended up with in my jsonnet k8s template:

local kvCacheSpace = std.toString(std.round(totalMemory * 0.5)),

Uh oh!

vLLM CPU Phi 3 mini 128K instruct - OOM issues #5059

Uh oh!

Uh oh!

thealmightygrant May 26, 2024

Replies: 5 comments · 2 replies

Uh oh!

thealmightygrant May 26, 2024 Author

Uh oh!

thealmightygrant May 26, 2024 Author

Uh oh!

thealmightygrant May 26, 2024 Author

Uh oh!

thealmightygrant May 26, 2024 Author

Uh oh!

thealmightygrant May 27, 2024 Author

Uh oh!

Uh oh!

anencore94 Oct 10, 2024

Uh oh!

thealmightygrant Nov 25, 2024 Author

thealmightygrant
May 26, 2024

Replies: 5 comments 2 replies

thealmightygrant
May 26, 2024
Author

thealmightygrant
May 26, 2024
Author

thealmightygrant
May 26, 2024
Author

thealmightygrant
May 26, 2024
Author

thealmightygrant
May 27, 2024
Author

thealmightygrant Nov 25, 2024
Author