-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
[WIP][Hardware][CPU] testing branch for mlperf #20473
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
fix downgrade fix iomp fix numba
fix compile
fix sgl-kernel bias type sgl moe sgl int8 mm
This reverts commit 058eca2.
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @bigPYJ1151, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request significantly enhances the VLLM CPU backend by integrating highly optimized C++ kernels for core computational tasks like GEMM and MoE, leveraging advanced CPU instruction sets. It also refines inter-process communication through shared memory and introduces comprehensive support for torch.compile
to boost performance. These changes aim to improve efficiency, expand quantization capabilities, and ensure compatibility with the latest VLLM architecture for CPU deployments.
Highlights
- CPU Backend Optimization (SGL Kernels): Introduced a new set of highly optimized C++ kernels, referred to as 'SGL kernels', for CPU operations. These kernels leverage Intel AVX512-BF16 and AVX512-VNNI (AMX) instructions to accelerate General Matrix Multiplication (GEMM) and Mixture-of-Experts (MoE) computations, supporting BFloat16, FP8, and INT8 data types. This includes specialized implementations for weight packing, scaled matrix multiplication, and fused MoE operations.
- Shared Memory (SHM) Communication Refinement: Refactored the shared memory communication mechanism for inter-process communication on CPU. This includes reducing buffer sizes, implementing double buffering, and introducing a new stamp-based synchronization method for improved performance and robustness in distributed CPU environments.
- torch.compile Integration and Optimization Levels: Added explicit support for
torch.compile
on the CPU backend, allowing users to enable different optimization levels (-O0
to-O3
) for kernel generation. This integration aims to reduce Python runtime overhead and generate more efficient kernels, with configurable options for the Inductor backend. - Quantization Support Expansion: Enhanced FP8 and INT8 quantization capabilities across linear layers and MoE layers on CPU. This includes new kernel implementations for INT8 scaled matrix multiplication with fused quantization, and improved handling of FP8 weight packing and scaled operations.
- Tensor Parallelism Compatibility: Modified the model configuration to allow attention head and intermediate sizes to be rounded up for divisibility by the tensor parallel size. This change improves compatibility for models whose dimensions are not naturally divisible by the number of tensor parallel workers.
- V1 Architecture Adoption for CPU: Migrated the CPU attention backend to the V1 architecture, introducing new V1-specific attention metadata builders and CPU worker implementations. This aligns the CPU backend with the latest VLLM architecture, while disabling cascade attention for CPU.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces significant performance optimizations for the CPU backend, likely for MLPerf testing. It adds a substantial number of new C++ kernels for AVX512, including GEMM and MoE implementations for various data types like FP8 and INT8. Additionally, it refactors the shared memory communication logic for improved performance.
Overall, the changes are extensive and aim for high performance. I've identified a critical bug in the weight packing logic for int8, which could lead to incorrect results or crashes. I also found some inconsistencies in preprocessor macro definitions across different files that should be unified to ensure correctness and maintainability. There's also a minor issue with an undefined macro that could lead to unexpected behavior. Addressing these points will improve the quality and robustness of these new features.
template <> | ||
inline void pack_vnni<int8_t>(int8_t* __restrict__ packed, const int8_t* __restrict__ weight, int N, int K) { | ||
constexpr int BLOCK_N = block_size_n(); | ||
TORCH_CHECK(N == BLOCK_N); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The assertion TORCH_CHECK(N == BLOCK_N)
is too restrictive and will cause runtime failures. The caller, convert_weight_packed
, processes weights in chunks, and the last chunk for a given dimension may be smaller than BLOCK_N
if the dimension is not perfectly divisible by BLOCK_N
.
Furthermore, there are other issues in pack_vnni<int8_t>
:
- The indexing
packed[k * N * VNNI_BLK + n * VNNI_BLK + d]
usesN
as the stride for the packed dimension. This is incorrect as subsequent operations likes8s8_compensation<BLOCK_N>
and the GEMM kernels expect a fixed stride ofBLOCK_N
. - When
N < BLOCK_N
, the packed buffer is only partially filled, and the remaining elements are uninitialized.s8s8_compensation
will read this uninitialized memory, leading to incorrect compensation values. The unused part of the block should be zero-padded.
This check should be removed, and the logic should be corrected to handle partial blocks by using the correct stride and padding.
"csrc/cpu/sgl-kernels/moe_int8.cpp" | ||
"csrc/cpu/sgl-kernels/moe_fp8.cpp" | ||
${VLLM_EXT_SRC}) | ||
add_compile_definitions(-DCPU_CAPABILITY_AVX512) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are inconsistent preprocessor definitions for CPU_CAPABILITY_AVX512
across the codebase, which could lead to incorrect builds or runtime behavior.
cmake/cpu_extension.cmake:245
: DefinesCPU_CAPABILITY_AVX512
ifENABLE_AVX512BF16 AND ENABLE_AVX512VNNI
.csrc/cpu/sgl-kernels/vec.h:3-5
: DefinesCPU_CAPABILITY_AVX512
if__AVX512F__ && __AVX512BF16__ && __AMX_BF16__
.csrc/cpu/torch_bindings.cpp:242
: Guards new op registrations with#if defined (__AVX512BF16__) && defined (__AVX512F__) && defined (__AVX512VNNI__)
.
These conditions are different. For example, the vec.h
definition depends on __AMX_BF16__
which is not checked in CMake, while the CMake logic depends on AVX512VNNI
which is not in the vec.h
condition.
To ensure correctness and maintainability, these definitions should be unified. I recommend using a single source of truth for this capability check, probably in this CMake file, and then using that definition throughout the C++ code. The definition in vec.h
should probably be guarded with #ifndef CPU_CAPABILITY_AVX512
to avoid redefinition warnings and use the CMake-provided definition.
inline __m512bh CVT_FP8_TO_BF16(__m256i a) { | ||
#ifdef SGLANG_CPU_FP8_CVT_FTZ | ||
return cvt_e4m3_bf16_intrinsic_no_nan(a); | ||
#else | ||
return cvt_e4m3_bf16_intrinsic_with_denorm(a); | ||
#endif | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The macro SGLANG_CPU_FP8_CVT_FTZ
is used here to select an FP8 to BF16 conversion intrinsic, but it is not defined anywhere in this pull request. This will cause the else
branch with cvt_e4m3_bf16_intrinsic_with_denorm
to always be taken.
If this is the intended default behavior, it would be clearer to state it explicitly with a comment. If different conversion behaviors are desired based on build configuration, please ensure SGLANG_CPU_FP8_CVT_FTZ
is defined appropriately during the build process. Adding a comment explaining the purpose of this macro and how to set it would also be beneficial for future maintenance.
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.Purpose
Test Plan
Test Result
(Optional) Documentation Update