[WIP][Hardware][CPU] testing branch for mlperf #20473

bigPYJ1151 · 2025-07-04T06:55:40Z

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

Test Plan

Test Result

(Optional) Documentation Update

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

fix downgrade fix iomp fix numba

fix compile

fix sgl-kernel bias type sgl moe sgl int8 mm

Signed-off-by: jiang.li <jiang1.li@intel.com>

This reverts commit 058eca2.

github-actions · 2025-07-04T06:55:48Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

mergify · 2025-07-04T06:56:20Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @bigPYJ1151.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

gemini-code-assist

Summary of Changes

Hello @bigPYJ1151, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the VLLM CPU backend by integrating highly optimized C++ kernels for core computational tasks like GEMM and MoE, leveraging advanced CPU instruction sets. It also refines inter-process communication through shared memory and introduces comprehensive support for torch.compile to boost performance. These changes aim to improve efficiency, expand quantization capabilities, and ensure compatibility with the latest VLLM architecture for CPU deployments.

Highlights

CPU Backend Optimization (SGL Kernels): Introduced a new set of highly optimized C++ kernels, referred to as 'SGL kernels', for CPU operations. These kernels leverage Intel AVX512-BF16 and AVX512-VNNI (AMX) instructions to accelerate General Matrix Multiplication (GEMM) and Mixture-of-Experts (MoE) computations, supporting BFloat16, FP8, and INT8 data types. This includes specialized implementations for weight packing, scaled matrix multiplication, and fused MoE operations.
Shared Memory (SHM) Communication Refinement: Refactored the shared memory communication mechanism for inter-process communication on CPU. This includes reducing buffer sizes, implementing double buffering, and introducing a new stamp-based synchronization method for improved performance and robustness in distributed CPU environments.
torch.compile Integration and Optimization Levels: Added explicit support for torch.compile on the CPU backend, allowing users to enable different optimization levels (-O0 to -O3) for kernel generation. This integration aims to reduce Python runtime overhead and generate more efficient kernels, with configurable options for the Inductor backend.
Quantization Support Expansion: Enhanced FP8 and INT8 quantization capabilities across linear layers and MoE layers on CPU. This includes new kernel implementations for INT8 scaled matrix multiplication with fused quantization, and improved handling of FP8 weight packing and scaled operations.
Tensor Parallelism Compatibility: Modified the model configuration to allow attention head and intermediate sizes to be rounded up for divisibility by the tensor parallel size. This change improves compatibility for models whose dimensions are not naturally divisible by the number of tensor parallel workers.
V1 Architecture Adoption for CPU: Migrated the CPU attention backend to the V1 architecture, introducing new V1-specific attention metadata builders and CPU worker implementations. This aligns the CPU backend with the latest VLLM architecture, while disabling cascade attention for CPU.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces significant performance optimizations for the CPU backend, likely for MLPerf testing. It adds a substantial number of new C++ kernels for AVX512, including GEMM and MoE implementations for various data types like FP8 and INT8. Additionally, it refactors the shared memory communication logic for improved performance.

Overall, the changes are extensive and aim for high performance. I've identified a critical bug in the weight packing logic for int8, which could lead to incorrect results or crashes. I also found some inconsistencies in preprocessor macro definitions across different files that should be unified to ensure correctness and maintainability. There's also a minor issue with an undefined macro that could lead to unexpected behavior. Addressing these points will improve the quality and robustness of these new features.

gemini-code-assist · 2025-07-04T07:00:04Z

csrc/cpu/sgl-kernels/gemm.cpp

+template <>
+inline void pack_vnni<int8_t>(int8_t* __restrict__ packed, const int8_t* __restrict__ weight, int N, int K) {
+  constexpr int BLOCK_N = block_size_n();
+  TORCH_CHECK(N == BLOCK_N);


The assertion TORCH_CHECK(N == BLOCK_N) is too restrictive and will cause runtime failures. The caller, convert_weight_packed, processes weights in chunks, and the last chunk for a given dimension may be smaller than BLOCK_N if the dimension is not perfectly divisible by BLOCK_N.

Furthermore, there are other issues in pack_vnni<int8_t>:

The indexing packed[k * N * VNNI_BLK + n * VNNI_BLK + d] uses N as the stride for the packed dimension. This is incorrect as subsequent operations like s8s8_compensation<BLOCK_N> and the GEMM kernels expect a fixed stride of BLOCK_N.

When N < BLOCK_N, the packed buffer is only partially filled, and the remaining elements are uninitialized. s8s8_compensation will read this uninitialized memory, leading to incorrect compensation values. The unused part of the block should be zero-padded.

This check should be removed, and the logic should be corrected to handle partial blocks by using the correct stride and padding.

gemini-code-assist · 2025-07-04T07:00:04Z

cmake/cpu_extension.cmake

+            "csrc/cpu/sgl-kernels/moe_int8.cpp"
+            "csrc/cpu/sgl-kernels/moe_fp8.cpp"
+            ${VLLM_EXT_SRC})
+        add_compile_definitions(-DCPU_CAPABILITY_AVX512)


There are inconsistent preprocessor definitions for CPU_CAPABILITY_AVX512 across the codebase, which could lead to incorrect builds or runtime behavior.

cmake/cpu_extension.cmake:245: Defines CPU_CAPABILITY_AVX512 if ENABLE_AVX512BF16 AND ENABLE_AVX512VNNI.

csrc/cpu/sgl-kernels/vec.h:3-5: Defines CPU_CAPABILITY_AVX512 if __AVX512F__ && __AVX512BF16__ && __AMX_BF16__.

csrc/cpu/torch_bindings.cpp:242: Guards new op registrations with #if defined (__AVX512BF16__) && defined (__AVX512F__) && defined (__AVX512VNNI__).

These conditions are different. For example, the vec.h definition depends on __AMX_BF16__ which is not checked in CMake, while the CMake logic depends on AVX512VNNI which is not in the vec.h condition.

To ensure correctness and maintainability, these definitions should be unified. I recommend using a single source of truth for this capability check, probably in this CMake file, and then using that definition throughout the C++ code. The definition in vec.h should probably be guarded with #ifndef CPU_CAPABILITY_AVX512 to avoid redefinition warnings and use the CMake-provided definition.

gemini-code-assist · 2025-07-04T07:00:04Z

csrc/cpu/sgl-kernels/vec.h

+inline __m512bh CVT_FP8_TO_BF16(__m256i a) {
+#ifdef SGLANG_CPU_FP8_CVT_FTZ
+  return cvt_e4m3_bf16_intrinsic_no_nan(a);
+#else
+  return cvt_e4m3_bf16_intrinsic_with_denorm(a);
+#endif
+}


The macro SGLANG_CPU_FP8_CVT_FTZ is used here to select an FP8 to BF16 conversion intrinsic, but it is not defined anywhere in this pull request. This will cause the else branch with cvt_e4m3_bf16_intrinsic_with_denorm to always be taken.

If this is the intended default behavior, it would be clearer to state it explicitly with a comment. If different conversion behaviors are desired based on build configuration, please ensure SGLANG_CPU_FP8_CVT_FTZ is defined appropriately during the build process. Adding a comment explaining the purpose of this macro and how to set it would also be beneficial for future maintenance.

DarkLight1337 and others added 19 commits May 24, 2025 23:05

[Misc] Reduce logs on startup (vllm-project#18649)

c533c6f

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

downgrade to 2.6

0e615db

fix downgrade fix iomp fix numba

tp=3

9d7dede

torch.compile

01a5fe2

fix compile

fix compressed-tensor

2408210

cpu v1

05719df

opt shm

1216cf5

sgl-kernel

592c599

fix sgl-kernel bias type sgl moe sgl int8 mm

fix v1 compile

0a09047

opt memcpy

0c87720

disable mla

b4f0618

sgl kernel opt for prefill

a0b2529

torch2.7

15bac8b

Signed-off-by: jiang.li <jiang1.li@intel.com>

enable V1 fp8 kv

8a52e6c

whisper torch compile

eb6d401

optimize whisper cross attn with varlen

058eca2

refine attn kernel dispatch for whisper

4e75814

Remove kv_cache_dtype assert for quantized model

6486c04

Revert "optimize whisper cross attn with varlen" (vllm-project#218)

a4825a0

This reverts commit 058eca2.

mergify bot added documentation Improvements or additions to documentation ci/build llama Related to Llama models v1 labels Jul 4, 2025

mergify bot added the needs-rebase label Jul 4, 2025

gemini-code-assist bot reviewed Jul 4, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[WIP][Hardware][CPU] testing branch for mlperf #20473

[WIP][Hardware][CPU] testing branch for mlperf #20473

Uh oh!

bigPYJ1151 commented Jul 4, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Jul 4, 2025

Uh oh!

mergify bot commented Jul 4, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jul 4, 2025

Uh oh!

gemini-code-assist bot Jul 4, 2025

Uh oh!

gemini-code-assist bot Jul 4, 2025

Uh oh!

Uh oh!

Uh oh!

[WIP][Hardware][CPU] testing branch for mlperf #20473

Are you sure you want to change the base?

[WIP][Hardware][CPU] testing branch for mlperf #20473

Uh oh!

Conversation

bigPYJ1151 commented Jul 4, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Essential Elements of an Effective PR Description Checklist

Purpose

Test Plan

Test Result

(Optional) Documentation Update

Uh oh!

github-actions bot commented Jul 4, 2025

Uh oh!

mergify bot commented Jul 4, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jul 4, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 4, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 4, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bigPYJ1151 commented Jul 4, 2025 •

edited by github-actions bot

Loading