-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
[PERF] Symmetric memory allreduce #20759
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[PERF] Symmetric memory allreduce #20759
Conversation
Signed-off-by: ilmarkov <imarkov@redhat.com>
Signed-off-by: ilmarkov <imarkov@redhat.com>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @ilmarkov, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request integrates a new, performance-optimized allreduce implementation into vLLM, utilizing PyTorch's symmetric memory features for CUDA. The primary goal is to enhance the efficiency of tensor parallel allreduce operations, particularly for mid-sized inputs, by providing a faster alternative that can be conditionally enabled via an environment variable. The changes also involve adapting existing allreduce strategies to work synergistically with this new method.
Highlights
- New Allreduce Implementation: Introduced a
SymmMemCommunicator
class that leverages PyTorch's symmetric memory capabilities for efficient allreduce operations on CUDA devices, providing an alternative to existingCustomAllreduce
and NCCL methods. - Conditional Activation and Strategy Selection: The new symmetric memory allreduce can be enabled via the
VLLM_ALLREDUCE_USE_SYMM_MEM
environment variable. The implementation intelligently selects between 'two-shot' (for world sizes 2 and 4) and 'multimem' (for world sizes 6 and 8) allreduce strategies based on the world size, and is prioritized for suitable input tensor sizes. - Performance Optimization and Benchmarking: This change aims to improve the performance of tensor parallel allreduce for middle-sized inputs, with benchmarks showing 7-10% improvement in Mean TTFT and up to 5-7% improvement in Mean TPOT for Llama-3.1-70B-Instruct with TP=4 and TP=8.
- Adaptive Custom Allreduce Bounds: The maximum supported input sizes for the existing
CustomAllreduce
implementation are adjusted when symmetric memory is enabled, ensuring that the most performant allreduce method is used for different input ranges.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
Signed-off-by: ilmarkov <imarkov@redhat.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a new all-reduce implementation using PyTorch's symmetric memory, improving performance for medium-sized tensors. The code is well-structured, but I suggest increasing the flexibility of SymmMemCommunicator
by allowing it to work with different dtypes and improving the accuracy of comments in custom_all_reduce.py
.
elif isinstance(device, str): | ||
device = torch.device(device) | ||
torch.cuda.set_device(device) | ||
self.dtype = torch.bfloat16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if symm_mem_comm is not None and not symm_mem_comm.disabled and \ | ||
symm_mem_comm.should_use_symm_mem(input_): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The check not symm_mem_comm.disabled
is redundant because should_use_symm_mem
already performs this check. Removing the redundant check will make the code more concise.
if symm_mem_comm is not None and not symm_mem_comm.disabled and \ | |
symm_mem_comm.should_use_symm_mem(input_): | |
if symm_mem_comm is not None and \ | |
symm_mem_comm.should_use_symm_mem(input_): |
# Max sizes for each world size in case symmetric memory is available | ||
_MAX_SIZES = { | ||
2: 2 * MiB, # 1 MB | ||
4: 2 * MiB, # 1 MB | ||
6: MiB, # 512 KB | ||
8: MiB // 2, # 512 KB | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comments for the sizes in the _MAX_SIZES
dictionary are inaccurate. Update the comments to reflect the correct sizes.
# Max sizes for each world size in case symmetric memory is available | |
_MAX_SIZES = { | |
2: 2 * MiB, # 1 MB | |
4: 2 * MiB, # 1 MB | |
6: MiB, # 512 KB | |
8: MiB // 2, # 512 KB | |
} | |
# Max sizes for each world size in case symmetric memory is available | |
_MAX_SIZES = { | |
2: 2 * MiB, # 2 MiB | |
4: 2 * MiB, # 2 MiB | |
6: MiB, # 1 MiB | |
8: MiB // 2, # 512 KiB | |
} |
Signed-off-by: ilmarkov <markovilya197@gmail.com>
This pull request has merge conflicts that must be resolved before it can be |
Add an alternative to custom_allreduce and nccl on cuda - pytorch symmetric memory
Enabled by environment variable
VLLM_ALLREDUCE_USE_SYMM_MEM=1
.Improves performance of TP allreduce for middle size input.
Bounds input sizes for custom allreduce as long as performance of two shot custom allreduce appears to be worse than nccl or pytorch symmetric memory based allreduce.
Max sizes for various world_sizes for custom allreduce and symmetric memory were chosen based on empirical results.
For world_sizes 2 and 4 pytorch two shot allreduce is used, for world sizes 6 and 8 pytorch multimem_all_reduce
Benchmark results:
Server:
VLLM_USE_V1=1 vllm serve meta-llama/Llama-3.1-70B-Instruct --disable-log-requests --no-enable-prefix-caching -tp $tp
Client:
On Blackwell, B200
TP=4:
Baseline:
PR:
TP=8
Baseline:
PR:
Up to 8% TTFT speedup for TP=4
From 7 to 10% TTFT improvement, and up to 5-7% TPOT improvement for TP=8.
Validation:
VLLM_ALLREDUCE_USE_SYMM_MEM=1 lm_eval --model vllm --model_args pretrained=meta-llama/Llama-3.1-70B-Instruct,tensor_parallel_size=4 --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size auto --limit 100