-
-
Notifications
You must be signed in to change notification settings - Fork 8.8k
Fused moe tuning ep #20863
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Fused moe tuning ep #20863
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,20 @@ | ||
ARG CUDA_VERSION=12.8.1 | ||
FROM nvidia/cuda:${CUDA_VERSION}-devel-ubuntu22.04 | ||
|
||
RUN apt update && apt install git -y && apt install curl -y | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To optimize the Docker image size and improve build caching, it's a best practice to combine
|
||
|
||
WORKDIR /workspace | ||
RUN git clone https://github.com/vllm-project/vllm.git | ||
|
||
COPY --from=ghcr.io/astral-sh/uv:latest /uv /uvx /bin/ | ||
|
||
# Install vllm. | ||
WORKDIR /workspace/vllm | ||
RUN uv venv .vllm --python 3.12 | ||
RUN . .vllm/bin/activate && VLLM_USE_PRECOMPILED=1 uv pip install -e . | ||
|
||
# Checkout a specific commit. | ||
ENV VLLM_SHA=550f8a052cae03c7e14a46767f689ab09c1cc28d | ||
RUN git fetch && git checkout ${VLLM_SHA} | ||
Comment on lines
+11
to
+18
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The order of operations for installing vLLM is incorrect. The current Dockerfile installs vLLM from the default branch and then checks out the specific commit defined by
|
||
|
||
ENTRYPOINT ["/bin/bash"] |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,63 @@ | ||
all: | ||
just llama-scout-bf16 && \ | ||
just llama-scout-fp8 && \ | ||
just llama-maverick && \ | ||
just qwen-30b && \ | ||
just qwen-30b-fp8 && \ | ||
just qwen-235b && \ | ||
just deepseek-r1 | ||
|
||
|
||
llama-scout-bf16: | ||
python3 benchmark_moe.py \ | ||
--model meta-llama/Llama-4-Scout-17B-16E-Instruct \ | ||
--tp-size 1 \ | ||
--ep-size 8 \ | ||
--tune | ||
|
||
llama-scout-fp8: | ||
python3 benchmark_moe.py \ | ||
--model meta-llama/Llama-4-Scout-17B-16E-Instruct \ | ||
--tp-size 1 \ | ||
--ep-size 8 \ | ||
--dtype fp8_w8a8 \ | ||
--tune | ||
|
||
llama-maverick: | ||
python3 benchmark_moe.py \ | ||
--model meta-llama/Llama-4-Maverick-17B-128E-Instruct \ | ||
--tp-size 1 \ | ||
--ep-size 8 \ | ||
--dtype fp8_w8a8 \ | ||
--tune | ||
|
||
qwen-30b: | ||
python3 benchmark_moe.py \ | ||
--model Qwen/Qwen3-30B-A3B \ | ||
--tp-size 1 \ | ||
--ep-size 8 \ | ||
--tune | ||
|
||
qwen-30b-fp8: | ||
python3 benchmark_moe.py \ | ||
--model Qwen/Qwen3-30B-A3B-FP8 \ | ||
--tp-size 1 \ | ||
--ep-size 8 \ | ||
--dtype fp8_w8a8 \ | ||
--tune | ||
|
||
qwen-235b: | ||
python3 benchmark_moe.py \ | ||
--model Qwen/Qwen3-235B-A22B \ | ||
--tp-size 1 \ | ||
--ep-size 8 \ | ||
--dtype fp8_w8a8 \ | ||
--tune | ||
|
||
deepseek-r1: | ||
python3 benchmark_moe.py \ | ||
--model deepseek-ai/DeepSeek-R1-0528 \ | ||
--tp-size 1 \ | ||
--ep-size 8 \ | ||
--dtype fp8_w8a8 \ | ||
--tune |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The specified CUDA version
12.8.1
does not appear to be a valid tag for thenvidia/cuda
Docker image. This will likely cause the Docker build to fail. Please use a valid CUDA version from Docker Hub. For example,12.5.1
is a recent, valid version.