-
Notifications
You must be signed in to change notification settings - Fork 42
removing quant and kv-cache fp8 from deepseek run instructions #509
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add description of why you are proposing this
docs/dev-docker/README.md
Outdated
@@ -377,7 +377,7 @@ python3 /app/vllm/benchmarks/benchmark_serving.py \ | |||
# Offline throughput | |||
python3 /app/vllm/benchmarks/benchmark_throughput.py --model deepseek-ai/DeepSeek-V3 \ | |||
--input-len <> --output-len <> --tensor-parallel-size 8 \ | |||
--quantization fp8 --kv-cache-dtype fp8 --dtype float16 \ | |||
--dtype float16 \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you specify why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Raises error an error:
export VLLM_MLA_DISABLE=0
export VLLM_USE_AITER=1
export VLLM_USE_TRITON_FLASH_ATTN=1
python3 /app/vllm/benchmarks/benchmark_throughput.py --model /data/DeepSeek-R1/ --input-len 128 --output-len 128 --tensor-parallel-size 8 --quantization fp8 --kv-cache-dtype fp8 --dtype bfloat16 --max-model-len 32768 --block-size=1 --trust-remote-code
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/dist-packages/vllm/attention/backends/triton_mla.py", line 63, in __init__
[rank0]: raise NotImplementedError(
[rank0]: NotImplementedError: TritonMLA with FP8 KV cache not yet supported
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is Triton MLA being used with AITER? cc @qli88
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@arakowsk-amd are you using the latest version? If you'd like we can discuss through Teams.
This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you! |
No description provided.