Skip to content

Fused moe tuning ep #20863

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 20 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
ARG CUDA_VERSION=12.8.1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The specified CUDA version 12.8.1 does not appear to be a valid tag for the nvidia/cuda Docker image. This will likely cause the Docker build to fail. Please use a valid CUDA version from Docker Hub. For example, 12.5.1 is a recent, valid version.

ARG CUDA_VERSION=12.5.1

FROM nvidia/cuda:${CUDA_VERSION}-devel-ubuntu22.04

RUN apt update && apt install git -y && apt install curl -y
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To optimize the Docker image size and improve build caching, it's a best practice to combine apt-get update and apt-get install into a single RUN layer. Additionally, you should clean up the apt cache in the same layer to reduce the final image size. You can also install multiple packages in a single apt-get install command.

RUN apt-get update && apt-get install -y --no-install-recommends git curl && rm -rf /var/lib/apt/lists/*


WORKDIR /workspace
RUN git clone https://github.com/vllm-project/vllm.git

COPY --from=ghcr.io/astral-sh/uv:latest /uv /uvx /bin/

# Install vllm.
WORKDIR /workspace/vllm
RUN uv venv .vllm --python 3.12
RUN . .vllm/bin/activate && VLLM_USE_PRECOMPILED=1 uv pip install -e .

# Checkout a specific commit.
ENV VLLM_SHA=550f8a052cae03c7e14a46767f689ab09c1cc28d
RUN git fetch && git checkout ${VLLM_SHA}
Comment on lines +11 to +18
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The order of operations for installing vLLM is incorrect. The current Dockerfile installs vLLM from the default branch and then checks out the specific commit defined by VLLM_SHA. This means the installed version of vLLM is not the one specified by the SHA. The git checkout command must be executed before installing the package to ensure the correct version is built and installed.

# Checkout a specific commit.
WORKDIR /workspace/vllm
ENV VLLM_SHA=550f8a052cae03c7e14a46767f689ab09c1cc28d
RUN git fetch && git checkout ${VLLM_SHA}

# Install vllm.
RUN uv venv .vllm --python 3.12
RUN . .vllm/bin/activate && VLLM_USE_PRECOMPILED=1 uv pip install -e .


ENTRYPOINT ["/bin/bash"]
63 changes: 63 additions & 0 deletions benchmarks/kernels/Justfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
all:
just llama-scout-bf16 && \
just llama-scout-fp8 && \
just llama-maverick && \
just qwen-30b && \
just qwen-30b-fp8 && \
just qwen-235b && \
just deepseek-r1


llama-scout-bf16:
python3 benchmark_moe.py \
--model meta-llama/Llama-4-Scout-17B-16E-Instruct \
--tp-size 1 \
--ep-size 8 \
--tune

llama-scout-fp8:
python3 benchmark_moe.py \
--model meta-llama/Llama-4-Scout-17B-16E-Instruct \
--tp-size 1 \
--ep-size 8 \
--dtype fp8_w8a8 \
--tune

llama-maverick:
python3 benchmark_moe.py \
--model meta-llama/Llama-4-Maverick-17B-128E-Instruct \
--tp-size 1 \
--ep-size 8 \
--dtype fp8_w8a8 \
--tune

qwen-30b:
python3 benchmark_moe.py \
--model Qwen/Qwen3-30B-A3B \
--tp-size 1 \
--ep-size 8 \
--tune

qwen-30b-fp8:
python3 benchmark_moe.py \
--model Qwen/Qwen3-30B-A3B-FP8 \
--tp-size 1 \
--ep-size 8 \
--dtype fp8_w8a8 \
--tune

qwen-235b:
python3 benchmark_moe.py \
--model Qwen/Qwen3-235B-A22B \
--tp-size 1 \
--ep-size 8 \
--dtype fp8_w8a8 \
--tune

deepseek-r1:
python3 benchmark_moe.py \
--model deepseek-ai/DeepSeek-R1-0528 \
--tp-size 1 \
--ep-size 8 \
--dtype fp8_w8a8 \
--tune
12 changes: 11 additions & 1 deletion benchmarks/kernels/benchmark_moe.py
Original file line number Diff line number Diff line change
Expand Up @@ -595,6 +595,13 @@
intermediate_size = config.intermediate_size
shard_intermediate_size = 2 * intermediate_size // args.tp_size

# Expert parallelism
if E % args.ep_size != 0:
raise ValueError(
f"Number of experts {E} must be divisible by expert parallel size {args.ep_size}"

Check failure on line 601 in benchmarks/kernels/benchmark_moe.py

View workflow job for this annotation

GitHub Actions / pre-commit

Ruff (E501)

benchmarks/kernels/benchmark_moe.py:601:89: E501 Line too long (93 > 88)
)
E = E // args.ep_size

hidden_size = config.hidden_size
dtype = torch.float16 if current_platform.is_rocm() else config.torch_dtype
use_fp8_w8a8 = args.dtype == "fp8_w8a8"
Expand Down Expand Up @@ -724,7 +731,10 @@
"--model", type=str, default="mistralai/Mixtral-8x7B-Instruct-v0.1"
)
parser.add_argument(
"--tp-size", "-tp", "--tensor-parallel-size", type=int, default=2
"--tp-size", "-tp", "--tensor-parallel-size", type=int, default=1
)
parser.add_argument(
"--ep-size", "-ep", "--expert-parallel-size", type=int, default=1
)
parser.add_argument(
"--dtype", type=str, choices=["auto", "fp8_w8a8", "int8_w8a16"], default="auto"
Expand Down
6 changes: 3 additions & 3 deletions tools/ep_kernels/install_python_libraries.sh
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ if [ ! -d "$WORKSPACE" ]; then
fi

# install dependencies if not installed
pip3 install cmake torch ninja
uv pip install cmake torch ninja

# build nvshmem
pushd $WORKSPACE
Expand Down Expand Up @@ -59,13 +59,13 @@ git clone https://github.com/ppl-ai/pplx-kernels
cd pplx-kernels
# see https://github.com/pypa/pip/issues/9955#issuecomment-838065925
# PIP_NO_BUILD_ISOLATION=0 disables build isolation
PIP_NO_BUILD_ISOLATION=0 TORCH_CUDA_ARCH_LIST=9.0a+PTX pip install -vvv -e .
PIP_NO_BUILD_ISOLATION=0 TORCH_CUDA_ARCH_LIST=9.0a+PTX uv pip install -vvv -e .
popd

# build and install deepep, require pytorch installed
pushd $WORKSPACE
git clone https://github.com/deepseek-ai/DeepEP
cd DeepEP
export NVSHMEM_DIR=$WORKSPACE/nvshmem_install
PIP_NO_BUILD_ISOLATION=0 pip install -vvv -e .
PIP_NO_BUILD_ISOLATION=0 uv pip install -vvv -e .
popd
5 changes: 2 additions & 3 deletions vllm/model_executor/layers/fused_moe/pplx_prepare_finalize.py
Original file line number Diff line number Diff line change
Expand Up @@ -197,14 +197,13 @@ def prepare(
# This argument is optional, defaults to indices.size(0)
# There's not much point setting this unless it is != indices.size(0)
bound_m: Optional[torch.Tensor] = None

self.a2a.dispatch(
out_expert_num_tokens=expert_num_tokens,
out_expert_x=expert_x,
out_expert_x_scale=expert_x_scale,
dp_x=a1q,
dp_x_scale=a1q_scale,
indices=topk_ids,
indices=topk_ids.view(dtype=torch.uint32),
bound_m=bound_m,
)

Expand Down Expand Up @@ -249,7 +248,7 @@ def finalize(
topk_weights = torch.ones_like(topk_weights)

self.a2a.combine(out_tokens=output,
indices=topk_ids,
indices=topk_ids.view(dtype=torch.uint32),
weights=topk_weights,
expert_y=fused_expert_output,
bound_m=bound_m)
Loading