Skip to content

4-bit model used more RAM than bf16 in HF transformers #1780

@weathon

Description

@weathon

System Info

Google Colab
80GB A100 GPU Linux
transformers-4.57.0.dev0
bitsandbytes 0.48.1

Reproduction

It was very weird that I tried to finetune Qwen3-VL-30B-A3B on Google Colab, and found that using 4-bit version actually raised VRAM OOM error, and if I use bf16 it runs fine.

If I added quantization_config=bnb_config it will have OOM

from transformers import BitsAndBytesConfig
import torch

from transformers import AutoModelForImageTextToText, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_compute_dtype=torch.bfloat16,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type='nf4',
)

# processor = AutoProcessor.from_pretrained("google/gemma-3-270m")

model = AutoModelForImageTextToText.from_pretrained(
    "Qwen/Qwen3-VL-30B-A3B-Instruct", torch_dtype=torch.bfloat16, device_map="auto"#, quantization_config=bnb_config
)

from peft import LoraConfig, get_peft_model

peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.0,
    r=16,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
    ]
)


from trl import SFTConfig

training_args = SFTConfig(
    output_dir="qwen_anti_aesthetics_3b",
    num_train_epochs=5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    gradient_accumulation_steps=2,
    gradient_checkpointing_kwargs={"use_reentrant": False},
    max_length=None,
    optim="adamw_torch_fused",
    learning_rate=2e-5,
    weight_decay=0.001,
    logging_steps=10,
    eval_steps=500,
    logging_strategy="steps",
    eval_strategy="steps",
    save_strategy="steps",
    save_steps=500,
    bf16=True,
    warmup_ratio=0.02,
    push_to_hub=True,
    report_to="wandb",
    remove_unused_columns=False,
    dataloader_num_workers=12,
    dataloader_prefetch_factor=4,
    dataloader_pin_memory=True,
    completion_only_loss=True,
    lr_scheduler_type="cosine",
)

trainer = SFTTrainer(
    model=model,
    peft_config=peft_config,
    args=training_args,
    train_dataset=train_ds,
    eval_dataset=test_ds,
)

Expected behavior

4bit version should use less, at least not more RAM, than regular model

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions