Skip to content

feat: enable FP8 quantized models loading #316

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

rafvasq
Copy link
Collaborator

@rafvasq rafvasq commented Jul 16, 2025

Description

  • "Enable" FP8 quantized models to load via fms_mo
  • Updates example to include quantization flag
  • Adds quantization flag (unused for now) in generate_spyre_vllm_output

Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com>
Copy link

👋 Hi! Thank you for contributing to vLLM support on Spyre.
Just a reminder: Make sure that your code passes all the linting checks, otherwise your PR won't be able to be merged. To do so, first install the linting requirements, then run format.sh and commit the changes. This can be done with uv directly:

uv sync --frozen --group lint --active --inexact

Or this can be done with pip:

uv pip compile --group lint > requirements-lint.txt
pip install -r requirements-lint.txt
bash format.sh

Now you are good to go 🚀

@rafvasq rafvasq changed the title [DRAFT] Quantized model testing [WIP] Quantized model testing Jul 16, 2025
rafvasq added 3 commits July 22, 2025 16:14
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com>
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com>
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com>
@rafvasq rafvasq changed the title [WIP] Quantized model testing feat: enable FP8 quantized models loading Jul 22, 2025
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com>
@rafvasq rafvasq marked this pull request as ready for review July 22, 2025 21:13
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com>
@@ -40,7 +40,7 @@ class SpyrePlatform(Platform):
# "spyre" device_name no longer worked due to https://github.com/vllm-project/vllm/pull/16464
device_name: str = "cpu"
device_type: str = "cpu"
supported_quantization: list[str] = ["gptq"]
supported_quantization: list[str] = ["gptq", "fp8", "compressed-tensors"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a mention to compressed-tensors anywhere else in this PR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe worth a comment linking to https://github.com/foundation-model-stack/fms-model-optimizer/pull/154/files#diff-1fb88c10872b0f03f1d0f6a00cd20328cdbb5e0e8bc53aa623be49b9ee3efe57R4-R9

sounds like fp8 support in fms-mo is focused on compressed-tensors

Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com>
@rafvasq rafvasq requested a review from maxdebayser July 23, 2025 20:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants