Distributed configuration for LLama 3.1 70b FP8 and 4 h100 GPUs server #7980
wakusoftware
announced in
General
Replies: 1 comment
-
I am facing a very similar problem with an 8-GPU node and smaller 8B models. I second the OP, what is the best solution for optimizing inference throughput in a single-node multi-GPU scenario, provided that the model fits on a single GPU? |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi. We are getting a server with 4 h100 and want to serve a LLama 3.1 70B FP8. That means the model fits in a single GPU (theoretically). Of course we want to optimize our resources, what would be the best configuration for vllm serve?
It would load the model 4 times with --tensor-parallel-size 4?
Beta Was this translation helpful? Give feedback.
All reactions