Distributed configuration for LLama 3.1 70b FP8 and 4 h100 GPUs server #7980

wakusoftware · 2024-08-29T02:56:41Z

wakusoftware
Aug 29, 2024

Hi. We are getting a server with 4 h100 and want to serve a LLama 3.1 70B FP8. That means the model fits in a single GPU (theoretically). Of course we want to optimize our resources, what would be the best configuration for vllm serve?

It would load the model 4 times with --tensor-parallel-size 4?

alex-petrenko · 2024-09-04T23:00:27Z

alex-petrenko
Sep 4, 2024

I am facing a very similar problem with an 8-GPU node and smaller 8B models.
Empirically, it actually seems like running eight separate vllm servers (one server per GPU) and balancing the load is actually faster than using --tensor-parallel-size 8, although this solution is not explained well in the documentation as far as I know.

I second the OP, what is the best solution for optimizing inference throughput in a single-node multi-GPU scenario, provided that the model fits on a single GPU?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Distributed configuration for LLama 3.1 70b FP8 and 4 h100 GPUs server #7980

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Distributed configuration for LLama 3.1 70b FP8 and 4 h100 GPUs server #7980

Uh oh!

wakusoftware Aug 29, 2024

Replies: 1 comment

Uh oh!

alex-petrenko Sep 4, 2024

wakusoftware
Aug 29, 2024

alex-petrenko
Sep 4, 2024