NVFP4 Emulation #59

dsikka · 2025-05-01T03:08:49Z

Summary:

Add CompressedTensors NVFP4 Emulation Scheme
Move emulations functionality into shared utilities

ModelOpt Emulation Changes:

Don't run activations quantization to start --> seemed to be getting gibberish even with it turned off
Update how the global scale is applied. I think they're storing the inverse? Updating this allows the coherent outputs for the Nvidia 70b checkpoint

Should now support ct models produced and compressed from the following branches:

FP4 Weights: [NVFP4][WIP] Add NVFp4 Support compressed-tensors#287
Compression: [WIP][NVFP4] Add compression/decompression code compressed-tensors#291
LLM Compressor: [NVFP4] Enable FP4 Weight-Only Quantization vllm-project/llm-compressor#1309

lm-evals/generations should work with weight only dequant:

lm_eval --model vllm \
    --model_args pretrained=nm-testing/llama2.c-stories110M-FP4,enforce_eager=True \
    --tasks gsm8k \
    --device cuda:0 \
    --batch_size 8

import numpy
import torch

from vllm import LLM, SamplingParams

prompts = ["The Swiss Alps are", "The president of the USA is", "The Boston Bruins are"]

# Create a sampling params object for greedy sampling
sampling_params = SamplingParams(temperature=0.80, top_p=0.95, max_tokens=40, min_tokens=10)
llm  = LLM('nm-testing/llama2.c-stories110M-FP4', enforce_eager=True)


# Print the outputs.
output = llm.generate(prompts, sampling_params)
for o in output:
    print(o.outputs[0].text)
    print("\n")

ToDo:

Generally activation quant support - still need to understand how the input scales should be applied
Improve compression speed in compressed-tensors. The current speed is a bottleneck atm

Signed-off-by: Dipika Sikka <dipikasikka1@gmail.com>

github-actions · 2025-05-01T03:08:59Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

…e the same shape as the local scale

update

8072051

Signed-off-by: Dipika Sikka <dipikasikka1@gmail.com>

dsikka requested review from mgoin, robertgshaw2-redhat and tlrmchlsmth as code owners May 1, 2025 03:08

dsikka changed the title ~~update~~ NVFP4 Emulation May 1, 2025

dsikka mentioned this pull request May 1, 2025

[WIP] NVFP4 Emulation #58

Open

dsikka added 5 commits May 1, 2025 13:56

move emulation utils into a shared files; add ct nvfp4 scheme

3fc300e

update

c99589a

fix condition, clean-up code

4104a49

update

6e45f99

swizzle

1940c94

This was referenced May 5, 2025

[NVFP4][WIP] Add NVFp4 Support neuralmagic/compressed-tensors#287

Closed

[Compressor][NVFP4] Support FP4 Compression neuralmagic/compressed-tensors#311

Merged

dsikka added 6 commits May 8, 2025 19:02

add code to requantize with the max or expand the global scale to hav…

6412e5b

…e the same shape as the local scale

update

4f11acb

update

802b6af

update emulation

8a38a88

update emulation

f087703

add script

f679c61

dsikka closed this May 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

NVFP4 Emulation #59

NVFP4 Emulation #59

Uh oh!

dsikka commented May 1, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented May 1, 2025

Uh oh!

Uh oh!

NVFP4 Emulation #59

NVFP4 Emulation #59

Uh oh!

Conversation

dsikka commented May 1, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented May 1, 2025

Uh oh!

Uh oh!

dsikka commented May 1, 2025 •

edited by github-actions bot

Loading