Using VLLM with a Tesla T4 on SageMaker Studio (ml.g4dn.xlarge instance) #5165

paulovasconcellos-hotmart · 2023-10-17T16:56:52Z

paulovasconcellos-hotmart
Oct 17, 2023

Hi everyone. I'm trying to use vLLM using a T4, but I'm facing some problems.

I'm trying to run Mistral models using vllm 0.2.1

With the following code, I receive a ValueError: Bfloat16 is only supported on GPUs with compute capability of at least 8.0. Your Tesla T4 GPU has compute capability 7.5.

from vllm import LLM, SamplingParams

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

llm = LLM(model="TheBloke/Mistral-7B-Instruct-v0.1-AWQ", quantization='awq', dtype='bfloat16', gpu_memory_utilization=0.5)

If I use another dtype or remove the quantization parameter, I get an OOM error.

esmeetu · 2023-10-18T00:30:36Z

esmeetu
Oct 18, 2023
Collaborator

You should use 'half' or 'float16' dtype, since T4 doesn't support 'bfloat16'.

0 replies

paulovasconcellos-hotmart · 2023-10-18T00:48:18Z

paulovasconcellos-hotmart
Oct 18, 2023
Author

I get na OOM error by either using half or float16

0 replies

esmeetu · 2023-10-18T00:52:26Z

esmeetu
Oct 18, 2023
Collaborator

@paulovasconcellos-hotmart You could set max-model-len parameter to lower until error message gone.

0 replies

esmeetu · 2023-10-18T00:54:52Z

esmeetu
Oct 18, 2023
Collaborator

And i didn't see that you set gpu_memory_utilization=0.5 which is too small(almost 8GiB left). You can also try to increase that.

0 replies

paulovasconcellos-hotmart · 2023-10-18T01:10:59Z

paulovasconcellos-hotmart
Oct 18, 2023
Author

Hey @esmeetu, I tried to run the following code using half

llm = LLM(model="TheBloke/Mistral-7B-Instruct-v0.1-AWQ", dtype='half', gpu_memory_utilization=.95)

And I received the following error: KeyError: 'model.layers.0.mlp.down_proj.qweight'

0 replies

esmeetu · 2023-10-18T01:47:39Z

esmeetu
Oct 18, 2023
Collaborator

@paulovasconcellos-hotmart Add quantization parameter in your code.

0 replies

paulovasconcellos-hotmart · 2023-10-18T02:01:25Z

paulovasconcellos-hotmart
Oct 18, 2023
Author

I ran with quantization parameter and max_model_len and it worked!
llm = LLM(model="TheBloke/Mistral-7B-Instruct-v0.1-AWQ", quantization='awq', dtype='half', gpu_memory_utilization=.95, max_model_len=8192)

Do you think I can increase max_model_len when using a T4?

0 replies

esmeetu · 2023-10-18T02:10:27Z

esmeetu
Oct 18, 2023
Collaborator

@paulovasconcellos-hotmart Of course, you can increase that param gradually until you got oom error, finally you will know how much model length your T4 can support.

0 replies

chymian · 2023-11-02T19:18:57Z

chymian
Nov 2, 2023

I try to do nearly the same and always get CUDA memory errors.

python -m vllm.entrypoints.openai.api_server --model TheBloke/dolphin-2.1-mistral-7B-AWQ --tensor-parallel-size 1 --dtype half --gpu-memory-utilization .95

run from git-installation or as docker has the same result:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 224.00 MiB (GPU 0; 7.79 GiB total capacity; 7.37 GiB already allocated; 79.81 MiB free; 7.38 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

the machine has 4 x 3070 8GB, UBU 20.04

I also tried with and without tensor-parallel-size 1/2, no avail.

0 replies

tranhoangnguyen03 · 2023-11-14T17:37:43Z

tranhoangnguyen03
Nov 14, 2023

I'm having an issue running the openai-mock server on colab (ngrok tunneled to a public url).
It just keeps loading without actually deploying the server.
Here's my script:

%%bash 
python -m vllm.entrypoints.openai.api_server \
--model TheBloke/dolphin-2.1-mistral-7B-AWQ \
--tensor-parallel-size 1 \
--quantization awq \
--dtype half \
--gpu-memory-utilization .95 \
--max-model-len=8000

0 replies

abdullahmirza-software · 2024-02-22T12:49:05Z

abdullahmirza-software
Feb 22, 2024

use this this work for @me
llm = LLM(model=model, dtype='half', gpu_memory_utilization=0.7,enforce_eager=True,max_model_len=6000)

0 replies

Uh oh!

Using VLLM with a Tesla T4 on SageMaker Studio (ml.g4dn.xlarge instance) #5165

Uh oh!

paulovasconcellos-hotmart Oct 17, 2023

Replies: 11 comments

Uh oh!

esmeetu Oct 18, 2023 Collaborator

Uh oh!

paulovasconcellos-hotmart Oct 18, 2023 Author

Uh oh!

esmeetu Oct 18, 2023 Collaborator

Uh oh!

esmeetu Oct 18, 2023 Collaborator

Uh oh!

paulovasconcellos-hotmart Oct 18, 2023 Author

Uh oh!

esmeetu Oct 18, 2023 Collaborator

Uh oh!

paulovasconcellos-hotmart Oct 18, 2023 Author

Uh oh!

esmeetu Oct 18, 2023 Collaborator

Uh oh!

chymian Nov 2, 2023

Uh oh!

tranhoangnguyen03 Nov 14, 2023

Uh oh!

abdullahmirza-software Feb 22, 2024

paulovasconcellos-hotmart
Oct 17, 2023

esmeetu
Oct 18, 2023
Collaborator

paulovasconcellos-hotmart
Oct 18, 2023
Author

esmeetu
Oct 18, 2023
Collaborator

esmeetu
Oct 18, 2023
Collaborator

paulovasconcellos-hotmart
Oct 18, 2023
Author

esmeetu
Oct 18, 2023
Collaborator

paulovasconcellos-hotmart
Oct 18, 2023
Author

esmeetu
Oct 18, 2023
Collaborator

chymian
Nov 2, 2023

tranhoangnguyen03
Nov 14, 2023

abdullahmirza-software
Feb 22, 2024