Replies: 1 comment
-
[Self response] |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I am curious about how to dispatch a large language model (LLM) into smaller pieces across GPUs using the vllm library.
For example, in the transformers library, adding
device_map="auto"
to the AutoModelForCausalLM.from_pretrained function allows the LLM to be split and loaded across multiple GPUs like this:Does vllm have a similar feature? What parameters should I add to the following code to enable dispatching the LLM across GPUs?
When I use
model = LLM(model="neuralmagic/Meta-Llama-3-70B-Instruct-quantized.w4a16", tensor_parallel_size=8)
, adding "tensor_parallel_size=8"I can see logs like "(VllmWorkerProcess pid=364103) INFO 07-12 22:51:00 model_runner.py:255] Loading model weights took 4.9631 GB".
However, after a few seconds, each GPU actually uses too many memory. (46974 / 49140 MB)
Thank you for your help!
Beta Was this translation helpful? Give feedback.
All reactions