Skip to content

removing quant and kv-cache fp8 from deepseek run instructions #509

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

arakowsk-amd
Copy link

No description provided.

Copy link
Collaborator

@shajrawi shajrawi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add description of why you are proposing this

@@ -377,7 +377,7 @@ python3 /app/vllm/benchmarks/benchmark_serving.py \
# Offline throughput
python3 /app/vllm/benchmarks/benchmark_throughput.py --model deepseek-ai/DeepSeek-V3 \
--input-len <> --output-len <> --tensor-parallel-size 8 \
--quantization fp8 --kv-cache-dtype fp8 --dtype float16 \
--dtype float16 \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you specify why?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Raises error an error:

export VLLM_MLA_DISABLE=0
export VLLM_USE_AITER=1
export VLLM_USE_TRITON_FLASH_ATTN=1
python3 /app/vllm/benchmarks/benchmark_throughput.py --model /data/DeepSeek-R1/ --input-len 128 --output-len 128 --tensor-parallel-size 8 --quantization fp8 --kv-cache-dtype fp8 --dtype bfloat16 --max-model-len 32768 --block-size=1 --trust-remote-code
 
 
[rank0]:                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/attention/backends/triton_mla.py", line 63, in __init__
[rank0]:     raise NotImplementedError(
[rank0]: NotImplementedError: TritonMLA with FP8 KV cache not yet supported

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is Triton MLA being used with AITER? cc @qli88

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arakowsk-amd are you using the latest version? If you'd like we can discuss through Teams.

Copy link

This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you!

@github-actions github-actions bot added the stale label Jul 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants