Replies: 2 comments 3 replies
-
See perf result here #2192. In certain cases, the custom topology drastically boosts performance compared to nccl's implementation. vLLM still uses nccl in majority of cases. |
Beta Was this translation helpful? Give feedback.
3 replies
-
In a set-up where 4 GPUs are connected by PCIe, but each pair of GPUs are connected by NVLink (112 GB/s bi-directional). Is there a way to specify a reduction first on each pairwise bound set of GPUs before reducing across the slower PCIe link? |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hello,
Was hoping someone could help shed some light on this -
why does vLLM choose to use a custom all-reduce method? Is there a benefit to doing this over just using the NCCL APIs?
Beta Was this translation helpful? Give feedback.
All reactions