Using VLLM with a Tesla T4 on SageMaker Studio (ml.g4dn.xlarge instance) #5165
Replies: 11 comments
-
You should use 'half' or 'float16' dtype, since T4 doesn't support 'bfloat16'. |
Beta Was this translation helpful? Give feedback.
-
I get na OOM error by either using |
Beta Was this translation helpful? Give feedback.
-
@paulovasconcellos-hotmart You could set max-model-len parameter to lower until error message gone. |
Beta Was this translation helpful? Give feedback.
-
And i didn't see that you set gpu_memory_utilization=0.5 which is too small(almost 8GiB left). You can also try to increase that. |
Beta Was this translation helpful? Give feedback.
-
Hey @esmeetu, I tried to run the following code using
And I received the following error: |
Beta Was this translation helpful? Give feedback.
-
@paulovasconcellos-hotmart Add quantization parameter in your code. |
Beta Was this translation helpful? Give feedback.
-
I ran with quantization parameter and Do you think I can increase |
Beta Was this translation helpful? Give feedback.
-
@paulovasconcellos-hotmart Of course, you can increase that param gradually until you got oom error, finally you will know how much model length your T4 can support. |
Beta Was this translation helpful? Give feedback.
-
I try to do nearly the same and always get CUDA memory errors. python -m vllm.entrypoints.openai.api_server --model TheBloke/dolphin-2.1-mistral-7B-AWQ --tensor-parallel-size 1 --dtype half --gpu-memory-utilization .95 run from git-installation or as docker has the same result:
the machine has 4 x 3070 8GB, UBU 20.04 I also tried with and without tensor-parallel-size 1/2, no avail. |
Beta Was this translation helpful? Give feedback.
-
I'm having an issue running the openai-mock server on colab (ngrok tunneled to a public url).
|
Beta Was this translation helpful? Give feedback.
-
![]() use this this work for @me |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi everyone. I'm trying to use vLLM using a T4, but I'm facing some problems.
I'm trying to run Mistral models using vllm 0.2.1
With the following code, I receive a
ValueError: Bfloat16 is only supported on GPUs with compute capability of at least 8.0. Your Tesla T4 GPU has compute capability 7.5.
If I use another dtype or remove the
quantization
parameter, I get an OOM error.Beta Was this translation helpful? Give feedback.
All reactions