[Bugfix] Fix topk_ids indices_type for cutlass w8a8 fp8 moe #1

minosfuture · 2025-06-24T22:17:24Z

Purpose

This PR fixes the following error when starting EP on Maverick:

(VllmWorker rank=3 pid=1737537) ERROR 06-15 22:58:28 [multiproc_executor.py:527]     run_cutlass_moe_fp8(output, hidden_states, w1, w2, topk_ids,
(VllmWorker rank=3 pid=1737537) ERROR 06-15 22:58:28 [multiproc_executor.py:527]   File "/home/yeq/gitrepos/vllm/vllm/model_executor/layers/fused_moe/cutlass_moe.py", line 89, in run_cutlass_moe_fp8
(VllmWorker rank=3 pid=1737537) ERROR 06-15 22:58:28 [multiproc_executor.py:527]     local_topk_ids = torch.where(expert_map[topk_ids] != -1,
(VllmWorker rank=3 pid=1737537) ERROR 06-15 22:58:28 [multiproc_executor.py:527]                                  ~~~~~~~~~~^^^^^^^^^^
(VllmWorker rank=3 pid=1737537) ERROR 06-15 22:58:28 [multiproc_executor.py:527] IndexError: tensors used as indices must be long, int, byte or bool tensors

In the PPLX implementation vllm-project#18762, the dtype got flipped to uint32, here.

Besides this PR, the workspace_shapes needs the fix here from vllm-project#19168; otherwise, the torch.zeros is slow for processing much larger size of data here.

Test Plan

# serve
vllm serve meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 \
        --max_model_len 8192 \
        --kv_cache_dtype fp8 \
        --enable-expert-parallel \
        --tensor-parallel-size 8 \
        --trust-remote-code \
        --enforce_eager \
        --gpu-memory-utilization 0.8 \
        --disable-log-requests 2>&1 | tee .env/ep_`date +%Y%m%d_%H%M%S`.log
# benchmark serve
python benchmarks/benchmark_serving.py  --model meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 \
        --port 8000  --dataset-name random  --ignore-eos  --num-prompts 500   --max-concurrency 128 \
        --random-input-len 2000 --random-output-len 150

Test Result

============ Serving Benchmark Result ============
Successful requests:                     500
Benchmark duration (s):                  42.34
Total input tokens:                      998815
Total generated tokens:                  75000
Request throughput (req/s):              11.81
Output token throughput (tok/s):         1771.43
Total Token throughput (tok/s):          25362.46
---------------Time to First Token----------------
Mean TTFT (ms):                          1119.22
Median TTFT (ms):                        384.95
P99 TTFT (ms):                           5939.92
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          63.31
Median TPOT (ms):                        66.14
P99 TPOT (ms):                           67.69
---------------Inter-token Latency----------------
Mean ITL (ms):                           63.31
Median ITL (ms):                         33.70
P99 ITL (ms):                            198.63
==================================================

(Optional) Documentation Update

BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing (anything written below this line will be removed by GitHub Actions)

Signed-off-by: Ming Yang <yming@meta.com>

minosfuture · 2025-07-01T16:33:31Z

moved to vllm-project#20166. closing.

[Bugfix] Fix topk_ids indices_type for cutlass w8a8 fp8 moe

1567180

Signed-off-by: Ming Yang <yming@meta.com>

minosfuture force-pushed the fix_topk_id_dtype branch from b76ba2e to 1567180 Compare June 27, 2025 05:10

Address comment: change moe_data topk_ids type back to int32_t

e5ed822

Signed-off-by: Ming Yang <yming@meta.com>

minosfuture closed this Jul 1, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix] Fix topk_ids indices_type for cutlass w8a8 fp8 moe #1

[Bugfix] Fix topk_ids indices_type for cutlass w8a8 fp8 moe #1

Uh oh!

minosfuture commented Jun 24, 2025

Uh oh!

minosfuture commented Jul 1, 2025

Uh oh!

Uh oh!

[Bugfix] Fix topk_ids indices_type for cutlass w8a8 fp8 moe #1

[Bugfix] Fix topk_ids indices_type for cutlass w8a8 fp8 moe #1

Uh oh!

Conversation

minosfuture commented Jun 24, 2025

Purpose

Test Plan

Test Result

(Optional) Documentation Update

Uh oh!

minosfuture commented Jul 1, 2025

Uh oh!

Uh oh!