Skip to content

nvfp4 w4a4 doesn't work on Qwen/Qwen3-235b-A22B #1624

@ehartford

Description

@ehartford

Describe the bug
NVFP4 quantization fails on Qwen3-235B (MoE) because FX tracing hits torch.vmap proxies in transformers masking_utils

Expected behavior
It should work

Environment

package version
llm-compressor 0.6.0 (latest PyPI)
transformers 4.43.0
torch 2.3.1 + CUDA 12.1
python 3.11
GPU A100 / H100 (same on both)

To Reproduce

from transformers import AutoModelForCausalLM, AutoTokenizer
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
import torch, datasets

model_id  = "Qwen/Qwen3-235B-A22B"
model     = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto")
tok       = AutoTokenizer.from_pretrained(model_id)
ds        = datasets.load_dataset("HuggingFaceH4/ultrachat_200k",
                                  split="train_sft[:8]")  # tiny calib

recipe = QuantizationModifier(targets="Linear", scheme="NVFP4", ignore=["lm_head"])
oneshot(model=model, dataset=ds, recipe=recipe, max_seq_length=2048,
        num_calibration_samples=8)          # --> TraceError / vmap-proxy error

Errors

ValueError: vmap(wrapped, in_dims=(0, None, None, None), ...)(<inputs>):
Got in_dim=0 for an input but the input is of type
<class 'transformers.utils.fx.HFProxy'>. We cannot vmap over non-Tensor arguments

Additional context

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions