Skip to content

[Bugfix] Fix topk_ids indices_type for cutlass w8a8 fp8 moe #1

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

minosfuture
Copy link
Owner

Purpose

This PR fixes the following error when starting EP on Maverick:

(VllmWorker rank=3 pid=1737537) ERROR 06-15 22:58:28 [multiproc_executor.py:527]     run_cutlass_moe_fp8(output, hidden_states, w1, w2, topk_ids,
(VllmWorker rank=3 pid=1737537) ERROR 06-15 22:58:28 [multiproc_executor.py:527]   File "/home/yeq/gitrepos/vllm/vllm/model_executor/layers/fused_moe/cutlass_moe.py", line 89, in run_cutlass_moe_fp8
(VllmWorker rank=3 pid=1737537) ERROR 06-15 22:58:28 [multiproc_executor.py:527]     local_topk_ids = torch.where(expert_map[topk_ids] != -1,
(VllmWorker rank=3 pid=1737537) ERROR 06-15 22:58:28 [multiproc_executor.py:527]                                  ~~~~~~~~~~^^^^^^^^^^
(VllmWorker rank=3 pid=1737537) ERROR 06-15 22:58:28 [multiproc_executor.py:527] IndexError: tensors used as indices must be long, int, byte or bool tensors

In the PPLX implementation vllm-project#18762, the dtype got flipped to uint32, here.

Besides this PR, the workspace_shapes needs the fix here from vllm-project#19168; otherwise, the torch.zeros is slow for processing much larger size of data here.

Test Plan

# serve
vllm serve meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 \
        --max_model_len 8192 \
        --kv_cache_dtype fp8 \
        --enable-expert-parallel \
        --tensor-parallel-size 8 \
        --trust-remote-code \
        --enforce_eager \
        --gpu-memory-utilization 0.8 \
        --disable-log-requests 2>&1 | tee .env/ep_`date +%Y%m%d_%H%M%S`.log
# benchmark serve
python benchmarks/benchmark_serving.py  --model meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 \
        --port 8000  --dataset-name random  --ignore-eos  --num-prompts 500   --max-concurrency 128 \
        --random-input-len 2000 --random-output-len 150

Test Result

============ Serving Benchmark Result ============
Successful requests:                     500
Benchmark duration (s):                  42.34
Total input tokens:                      998815
Total generated tokens:                  75000
Request throughput (req/s):              11.81
Output token throughput (tok/s):         1771.43
Total Token throughput (tok/s):          25362.46
---------------Time to First Token----------------
Mean TTFT (ms):                          1119.22
Median TTFT (ms):                        384.95
P99 TTFT (ms):                           5939.92
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          63.31
Median TPOT (ms):                        66.14
P99 TPOT (ms):                           67.69
---------------Inter-token Latency----------------
Mean ITL (ms):                           63.31
Median ITL (ms):                         33.70
P99 ITL (ms):                            198.63
==================================================

(Optional) Documentation Update

BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing (anything written below this line will be removed by GitHub Actions)

Signed-off-by: Ming Yang <yming@meta.com>
Signed-off-by: Ming Yang <yming@meta.com>
@minosfuture
Copy link
Owner Author

moved to vllm-project#20166. closing.

@minosfuture minosfuture closed this Jul 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant