Skip to content

v4.52.1: Qwen2.5-Omni, SAM-HQ, GraniteMoeHybrid, D-FINE, CSM, BitNet, LlamaGuard, TimesFM, MLCD, Janus, InternVL

Compare
Choose a tag to compare
@LysandreJik LysandreJik released this 20 May 16:32
· 56 commits to main since this release

New models

Qwen2.5-Omni

image

The Qwen2.5-Omni model is a unified multiple modalities model proposed in Qwen2.5-Omni Technical Report from Qwen team, Alibaba Group.

The abstract from the technical report is the following:

We present Qwen2.5-Omni, an end-to-end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner. To enable the streaming of multimodal information inputs, both audio and visual encoders utilize a block-wise processing approach. This strategy effectively decouples the handling of long sequences of multimodal data, assigning the perceptual responsibilities to the multimodal encoder and entrusting the modeling of extended sequences to a large language model.

Such a division of labor enhances the fusion of different modalities via the shared attention mechanism. To synchronize the timestamps of video inputs with audio, we organized the audio and video sequentially in an interleaved manner and propose a novel position embedding approach, named TMRoPE (Time-aligned Multimodal RoPE). To concurrently generate text and speech while avoiding interference between the two modalities, we propose Thinker-Talker architecture.

In this framework, Thinker functions as a large language model tasked with text generation, while Talker is a dual-track autoregressive model that directly utilizes the hidden representations from the Thinker to produce audio tokens as output. Both the Thinker and Talker models are designed to be trained and inferred in an end-to-end manner. For decoding audio tokens in a streaming manner, we introduce a sliding-window DiT that restricts the receptive field, aiming to reduce the initial package delay. Qwen2.5-Omni outperforms the similarly sized Qwen2-VL and Qwen2-Audio in both image and audio capabilities. Furthermore, Qwen2.5-Omni achieves state-of-the-art performance on multimodal benchmarks like Omni-Bench.

Notably, Qwen2.5-Omni is the first open-source model to achieve a level of performance in end-to-end speech instruction following that is comparable to its capabilities with text inputs, as evidenced by benchmarks such as MMLU and GSM8K. As for speech generation, Qwen2.5-Omni’s streaming Talker outperform most existing streaming and non-streaming alternatives in robustness and naturalness.

SAM-HQ

SAM-HQ (High-Quality Segment Anything Model) was proposed in Segment Anything in High Quality by Lei Ke, Mingqiao Ye, Martin Danelljan, Yifan Liu, Yu-Wing Tai, Chi-Keung Tang, Fisher Yu.

The model is an enhancement to the original SAM model that produces significantly higher quality segmentation masks while maintaining SAM's original promptable design, efficiency, and zero-shot generalizability.

example image

SAM-HQ introduces several key improvements over the original SAM model:

  1. High-Quality Output Token: A learnable token injected into SAM's mask decoder for higher quality mask prediction
  2. Global-local Feature Fusion: Combines features from different stages of the model for improved mask details
  3. Training Data: Uses a carefully curated dataset of 44K high-quality masks instead of SA-1B
  4. Efficiency: Adds only 0.5% additional parameters while significantly improving mask quality
  5. Zero-shot Capability: Maintains SAM's strong zero-shot performance while improving accuracy

The abstract from the paper is the following:

The recent Segment Anything Model (SAM) represents a big leap in scaling up segmentation models, allowing for powerful zero-shot capabilities and flexible prompting. Despite being trained with 1.1 billion masks, SAM's mask prediction quality falls short in many cases, particularly when dealing with objects that have intricate structures. We propose HQ-SAM, equipping SAM with the ability to accurately segment any object, while maintaining SAM's original promptable design, efficiency, and zero-shot generalizability. Our careful design reuses and preserves the pre-trained model weights of SAM, while only introducing minimal additional parameters and computation. We design a learnable High-Quality Output Token, which is injected into SAM's mask decoder and is responsible for predicting the high-quality mask. Instead of only applying it on mask-decoder features, we first fuse them with early and final ViT features for improved mask details. To train our introduced learnable parameters, we compose a dataset of 44K fine-grained masks from several sources. HQ-SAM is only trained on the introduced dataset of 44k masks, which takes only 4 hours on 8 GPUs.

Tips:

  • SAM-HQ produces higher quality masks than the original SAM model, particularly for objects with intricate structures and fine details
  • The model predicts binary masks with more accurate boundaries and better handling of thin structures
  • Like SAM, the model performs better with input 2D points and/or input bounding boxes
  • You can prompt multiple points for the same image and predict a single high-quality mask
  • The model maintains SAM's zero-shot generalization capabilities
  • SAM-HQ only adds ~0.5% additional parameters compared to SAM
  • Fine-tuning the model is not supported yet

GraniteMoeHybrid

image

The GraniteMoeHybrid model builds on top of GraniteMoeSharedModel and Bamba. Its decoding layers consist of state space layers or MoE attention layers with shared experts. By default, the attention layers do not use positional encoding.

D-FINE

image

The D-FINE model was proposed in D-FINE: Redefine Regression Task in DETRs as Fine-grained Distribution Refinement by
Yansong Peng, Hebei Li, Peixi Wu, Yueyi Zhang, Xiaoyan Sun, Feng Wu

The abstract from the paper is the following:

We introduce D-FINE, a powerful real-time object detector that achieves outstanding localization precision by redefining the bounding box regression task in DETR models. D-FINE comprises two key components: Fine-grained Distribution Refinement (FDR) and Global Optimal Localization Self-Distillation (GO-LSD).
FDR transforms the regression process from predicting fixed coordinates to iteratively refining probability distributions, providing a fine-grained intermediate representation that significantly enhances localization accuracy. GO-LSD is a bidirectional optimization strategy that transfers localization knowledge from refined distributions to shallower layers through self-distillation, while also simplifying the residual prediction tasks for deeper layers. Additionally, D-FINE incorporates lightweight optimizations in computationally intensive modules and operations, achieving a better balance between speed and accuracy. Specifically, D-FINE-L / X achieves 54.0% / 55.8% AP on the COCO dataset at 124 / 78 FPS on an NVIDIA T4 GPU. When pretrained on Objects365, D-FINE-L / X attains 57.1% / 59.3% AP, surpassing all existing real-time detectors. Furthermore, our method significantly enhances the performance of a wide range of DETR models by up to 5.3% AP with negligible extra parameters and training costs. Our code and pretrained models: this https URL.

CSM

The Conversational Speech Model (CSM) is the first open-source contextual text-to-speech model released by Sesame. It is designed to generate natural-sounding speech with or without conversational context. This context typically consists of multi-turn dialogue between speakers, represented as sequences of text and corresponding spoken audio.

Model Architecture:
CSM is composed of two LLaMA-style auto-regressive transformer decoders: a backbone decoder that predicts the first codebook token and a depth decoder that generates the remaining tokens. It uses the pretrained codec model Mimi, introduced by Kyutai, to encode speech into discrete codebook tokens and decode them back into audio.

The original csm-1b checkpoint is available under the Sesame organization on Hugging Face.

BitNet

image

Trained on a corpus of 4 trillion tokens, this model demonstrates that native 1-bit LLMs can achieve performance comparable to leading open-weight, full-precision models of similar size, while offering substantial advantages in computational efficiency (memory, energy, latency).

LlamaGuard

image

Llama Guard 4 is a new multimodal model designed to detect inappropriate content in images and text, whether used as input or generated as output by the model. It’s a dense 12B model pruned from Llama 4 Scout model, and it can run on a single GPU (24 GBs of VRAM). It can evaluate both text-only and image+text inputs, making it suitable for filtering both inputs and outputs of large language models. This enables flexible moderation pipelines where prompts are analyzed before reaching the model, and generated responses are reviewed afterwards for safety. It can also understand multiple languages.

TimesFM

image

TimesFM (Time Series Foundation Model) is a pretrained time-series foundation model proposed in A decoder-only foundation model for time-series forecasting by Abhimanyu Das, Weihao Kong, Rajat Sen, and Yichen Zhou. It is a decoder only model that uses non-overlapping patches of time-series data as input and outputs some output patch length prediction in an autoregressive fashion.

The abstract from the paper is the following:

Motivated by recent advances in large language models for Natural Language Processing (NLP), we design a time-series foundation model for forecasting whose out-of-the-box zero-shot performance on a variety of public datasets comes close to the accuracy of state-of-the-art supervised forecasting models for each individual dataset. Our model is based on pretraining a patched-decoder style attention model on a large time-series corpus, and can work well across different forecasting history lengths, prediction lengths and temporal granularities.

MLCD

image

The MLCD models were released by the DeepGlint-AI team in unicom, which focuses on building foundational visual models for large multimodal language models using large-scale datasets such as LAION400M and COYO700M, and employs sample-to-cluster contrastive learning to optimize performance. MLCD models are primarily used for multimodal visual large language models, such as LLaVA.

Janus

image

The Janus Model was originally proposed in Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation by DeepSeek AI team and later refined in Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling. Janus is a vision-language model that can generate both image and text output, it can also take both images and text as input.

Note

The model doesn't generate both images and text in an interleaved format. The user has to pass a parameter indicating whether to generate text or image.

The abstract from the original paper is the following:

In this paper, we introduce Janus, an autoregressive framework that unifies multimodal understanding and generation. Prior research often relies on a single visual encoder for both tasks, such as Chameleon. However, due to the differing levels of information granularity required by multimodal understanding and generation, this approach can lead to suboptimal performance, particularly in multimodal understanding. To address this issue, we decouple visual encoding into separate pathways, while still leveraging a single, unified transformer architecture for processing. The decoupling not only alleviates the conflict between the visual encoder's roles in understanding and generation, but also enhances the framework's flexibility. For instance, both the multimodal understanding and generation components can independently select their most suitable encoding methods. Experiments show that Janus surpasses previous unified model and matches or exceeds the performance of task-specific models. The simplicity, high flexibility, and effectiveness of Janus make it a strong candidate for next-generation unified multimodal models.

The abstract from the aforementioned Janus-Pro paper, released afterwards, is the following:

In this work, we introduce Janus-Pro, an advanced version of the previous work Janus. Specifically, Janus-Pro incorporates (1) an optimized training strate (2) expanded training data, and (3) scaling to larger model size. With these improvements, Janus-Pro achieves significant advancements in both multimodal understanding and text-to-image instruction-following capabilities, while also enhancing the stability of text-to-image generation. We hope this work will inspire further exploration in the field. Code and models are publicly available.

InternVL

The InternVL3 family of Visual Language Models was introduced in InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models.

The abstract from the paper is the following:

We introduce InternVL3, a significant advancement in the InternVL series featuring a native multimodal pre-training paradigm. Rather than adapting a text-only large language model (LLM) into a multimodal large language model (MLLM) that supports visual inputs, InternVL3 jointly acquires multimodal and linguistic capabilities from both diverse multimodal data and pure-text corpora during a single pre-training stage. This unified training paradigm effectively addresses the complexities and alignment challenges commonly encountered in conventional post-hoc training pipelines for MLLMs. To further improve performance and scalability, InternVL3 incorporates variable visual position encoding (V2PE) to support extended multimodal contexts, employs advanced post-training techniques such as supervised fine-tuning (SFT) and mixed preference optimization (MPO), and adopts test-time scaling strategies alongside an optimized training infrastructure. Extensive empirical evaluations demonstrate that InternVL3 delivers superior performance across a wide range of multi-modal tasks. In particular, InternVL3-78B achieves a score of 72.2 on the MMMU benchmark, setting a new state-of-the-art among open-source MLLMs. Its capabilities remain highly competitive with leading proprietary models, including ChatGPT-4o, Claude 3.5 Sonnet, and Gemini 2.5 Pro, while also maintaining strong pure-language proficiency. In pursuit of open-science principles, we will publicly release both the training data and model weights to foster further research and development in next-generation MLLMs.

drawing

Overview of InternVL3 models architecture, which is the same as InternVL2.5. Taken from the original checkpoint.

drawing

Comparison of InternVL3 performance on OpenCompass against other SOTA VLLMs. Taken from the original checkpoint.

Kernel integration

We integrate some kernels in the transformers library via the kernels package: https://github.com/huggingface/kernels
We start with some kernels in the Llama model, and we iterate to identify the best performance optimizations

TP support

In the previous release, we've added TP support in order to run distributed inference. However, this is not supported for all quantization methods. We are progressively adding support to it. Right now, only compressed-tensors, fp8 and fp8-fbgemm support it.

Quantization

AutoRound

From the AutoRound contributors:

AutoRound is an advanced quantization algorithm that delivers strong accuracy, even at 2-bit precision. It leverages sign gradient descent to fine-tune both rounding values and min-max clipping thresholds in just 200 steps ... More details here: https://github.com/intel/auto-round

Quantization Documentation

We have added two new sections to better understand and get started with quantization:

GGUF

We've added GGUF support to gemma3 family models.

Fast image processors

Most Vision Models and VLMs in Transformers can now benefit from fast image processors. By utilizing torch/torchvision functional transforms, these processors offer a substantial speedup when processing images compared to PiL/numpy functions, and support processing on both CPU and CUDA.

AutoDocstring

The new @auto_docstring decorator makes it easier to add proper documentation when contributing a model without bloating the modeling code:

Custom generate

We now support custom generate methods to be loaded from model.generate. The custom generate methods can be stored on the Hub, enabling quick distribution of experiments regarding new caches, decoding methods, heuristics, ...

from transformers import AutoModelForCausalLM, AutoTokenizer

# `generate` with `custom_generate` -> `generate` uses custom code
# note: calling the custom method prints "✨ using a custom generation method ✨"
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct", device_map="auto")

inputs = tokenizer(["The quick brown"], return_tensors="pt").to(model.device)
gen_out = model.generate(**inputs, custom_generate="transformers-community/custom_generate_example", trust_remote_code=True)
print(tokenizer.batch_decode(gen_out, skip_special_tokens=True))

You can find the docs here, and all custom generation methods by searching for the custom_generate tag.

  • [generate] Run custom generation code from the Hub by @gante in #36405

Chat CLI

The transformers-cli command is updated to be simpler and cleaner, specifically for its chat variant.

The following is now possible and recommended:

transformers chat Qwen/Qwen2.5-3B-Instruct

Additionally, almost any generate flag can now be passed as a positional argument, present and future, as opposed to being limited to a set of hardcoded flags, for example:

transformers chat Qwen/Qwen2.5-0.5B-Instruct do_sample=False max_new_tokens=10
  • Transformers cli clean command by @LysandreJik in #37657
  • [chat] clean code and add base help by @gante in #37892
  • [chat] generate parameterization powered by GenerationConfig and UX-related changes by @gante in #38047

Breaking changes

Deprecations

The agents folder is finally removed from transformers in favour of using smolagents.

We are moving away from torch 2.0 as it has been released more than two years ago.

General bugfixes and improvements

Significant community contributions

The following contributors have made significant changes to the library over the last release:

  • @cyyever
    • Use Python 3.9 syntax in examples (#37279)
    • Use Python 3.9 syntax in tests (#37343)
    • Remove old code for PyTorch, Accelerator and tokenizers (#37234)
    • Add weights_only=True to torch.load (#37062)
    • Add XPU case to is_torch_bf16_gpu_available (#37132)
    • Remove fsspec dependency which isn't directly used by transformers (#37318)
    • Add Optional to remaining types (#37808)
    • Enable RUF013 to enforce optional typing (#37266)
  • @yao-matrix
    • enable 2 llama UT cases on xpu (#37126)
    • enhance require_deterministic_for_xpu (#37437)
    • make test_snowman_image_captioning pass on XPU, by sharing same atol w/ ROCM (#37480)
    • fix and enhance pipeline_webserver.md (#36992)
    • enable 5 cases on XPU (#37507)
    • enable several cases on XPU (#37516)
    • enable test_offloaded_cache_implementation on XPU (#37514)
    • enable 3 mpt test cases on XPU (#37546)
    • enable 6 rt_detr_v2 cases on xpu (#37548)
    • enable 6 gemma2 cases on XPU (#37564)
    • enable 6 modeling cases on XPU (#37571)
    • fix 2 encoder_decoder issues on XPU (#37572)
    • enable mllama cases on xpu (#37644)
    • enable 6 granite cases on xpu (#37569)
    • enable blip2 and emu3 cases on XPU (#37662)
    • enable cpu offloading for Bark on xpu (#37599)
    • enable 4 test_trainer cases on XPU (#37645)
    • enable internvl UTs on XPU (#37779)
    • enable xpu in test_trainer (#37774)
    • make aya vision 5 integration tests pass on xpu (#37990)
    • enable mamba2 integration cases on xpu (#38006)
    • make mistral3 pass on xpu (#37882)
    • enable utils test cases on XPU (#38005)
    • enable generation fsdp/utils cases on XPU (#38009)
    • enable finegrained_fp8 and granite_speech cases on XPU (#38036)
    • enable csm integration cases on xpu, all passed (#38140)
    • enable trainer test cases on xpu (#38138)
    • enable autoround cases on XPU (#38167)
    • clean autoawq cases on xpu (#38163)
  • @alex-jw-brooks
    • Expose blip2qformer (#37254)
    • Add Granite Speech Support (#36801)
    • Fix qwen2audio wanr -> warn (#37559)
    • Allow Exclusion of Input IDs from RepetitionPenaltyLogitsProcessor (#37625)
    • Enable granite speech 3.3 tests (#37560)
  • @BakerBunker
    • Add Qwen2.5-Omni (#36752)
    • Fix inference bugs in Qwen2.5 Omni (#37701)
    • Fix embeds_to_talker device in Qwen2.5-Omni (#37739)
    • Fix Qwen2.5 Omni SinusoidsPositionEmbedding precision (#38151)
  • @rootonchair
    • Add Fast Image Processor for Perceiver (#37176)
    • Add Fast Image Processor for Flava (#37135)
    • Add Fast Image Processor for LayoutLMv2 (#37203)
    • Add Fast Image Processor for LayoutLMv3 (#37201)
    • Add Fast Image Processor for Donut (#37081)
    • Bridgetower fast image processor (#37373)
    • Add Fast Image Processor for PoolFormer (#37182)
  • @flukeskywalker
    • Fix mask handling for flex attention in llama/gemma2/mistral/qwen2 (#37381)
  • @keetrap
    • Add Fast LeViT Processor (#37154)
    • Add Fast Mobilenet-V2 Processor (#37113)
    • Add Fast owlvit Processor (#37164)
    • Add Fast Yolos Processor (#37292)
    • Add Fast Chinese-CLIP Processor (#37012)
    • Add Fast Conditional-DETR Processor (#37071)
    • Add Fast Grounding-Dino Processor (#37108)
    • Add Fast PVT Processor (#37204)
  • @tanhuajie
  • @jinan-zhou
    • Add TimesFM Time Series Forecasting Model (#34082)
  • @yaswanth19
  • @saswatmeher
    • chore: update model card for SigLIP (#37585)
    • chore: update SigLIP2 model card (#37624)
  • @cyr0930
    • [fix] make legacy bnb code work (#37331)
    • [llava] one pixel is missing from padding when length is odd (#37819)
    • [bug] fix llava processor to calculate unpadding size correctly (#37988)
  • @wenhuach21
    • Add AutoRound quantization support (#37393)
  • @devxaitist
    • 🌐 [i18n-KO] Translated siglip.md to Korean (#37145)
    • Add Fast Image Processor for vilt (#37304)
  • @co63oc
    • Fix typos in comments (#37694)
    • Fix typos in strings and comments (#37784)
    • Fix typos in strings and comments (#37799)
    • Fix typos in strings and comments (#37910)
  • @guangy10
    • Gemma3 is Torch Exportable (#37728)
    • Allow override inputs to export recipe (#37508)
    • Fix Qwen models export with torch 2.7 (#37985)
  • @sushmanthreddy
    • Samhq model addition (#35147)
  • @VladOS95-cyber
    • Add D-FINE Model into Transformers (#36261)
  • @Ssukriti
    • Add GraniteMoeHybrid support for 4.0 (#37658)