v4.52.1: Qwen2.5-Omni, SAM-HQ, GraniteMoeHybrid, D-FINE, CSM, BitNet, LlamaGuard, TimesFM, MLCD, Janus, InternVL
New models
Qwen2.5-Omni

The Qwen2.5-Omni model is a unified multiple modalities model proposed in Qwen2.5-Omni Technical Report from Qwen team, Alibaba Group.
The abstract from the technical report is the following:
We present Qwen2.5-Omni, an end-to-end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner. To enable the streaming of multimodal information inputs, both audio and visual encoders utilize a block-wise processing approach. This strategy effectively decouples the handling of long sequences of multimodal data, assigning the perceptual responsibilities to the multimodal encoder and entrusting the modeling of extended sequences to a large language model.
Such a division of labor enhances the fusion of different modalities via the shared attention mechanism. To synchronize the timestamps of video inputs with audio, we organized the audio and video sequentially in an interleaved manner and propose a novel position embedding approach, named TMRoPE (Time-aligned Multimodal RoPE). To concurrently generate text and speech while avoiding interference between the two modalities, we propose Thinker-Talker architecture.
In this framework, Thinker functions as a large language model tasked with text generation, while Talker is a dual-track autoregressive model that directly utilizes the hidden representations from the Thinker to produce audio tokens as output. Both the Thinker and Talker models are designed to be trained and inferred in an end-to-end manner. For decoding audio tokens in a streaming manner, we introduce a sliding-window DiT that restricts the receptive field, aiming to reduce the initial package delay. Qwen2.5-Omni outperforms the similarly sized Qwen2-VL and Qwen2-Audio in both image and audio capabilities. Furthermore, Qwen2.5-Omni achieves state-of-the-art performance on multimodal benchmarks like Omni-Bench.
Notably, Qwen2.5-Omni is the first open-source model to achieve a level of performance in end-to-end speech instruction following that is comparable to its capabilities with text inputs, as evidenced by benchmarks such as MMLU and GSM8K. As for speech generation, Qwen2.5-Omni’s streaming Talker outperform most existing streaming and non-streaming alternatives in robustness and naturalness.
SAM-HQ
SAM-HQ (High-Quality Segment Anything Model) was proposed in Segment Anything in High Quality by Lei Ke, Mingqiao Ye, Martin Danelljan, Yifan Liu, Yu-Wing Tai, Chi-Keung Tang, Fisher Yu.
The model is an enhancement to the original SAM model that produces significantly higher quality segmentation masks while maintaining SAM's original promptable design, efficiency, and zero-shot generalizability.
SAM-HQ introduces several key improvements over the original SAM model:
- High-Quality Output Token: A learnable token injected into SAM's mask decoder for higher quality mask prediction
- Global-local Feature Fusion: Combines features from different stages of the model for improved mask details
- Training Data: Uses a carefully curated dataset of 44K high-quality masks instead of SA-1B
- Efficiency: Adds only 0.5% additional parameters while significantly improving mask quality
- Zero-shot Capability: Maintains SAM's strong zero-shot performance while improving accuracy
The abstract from the paper is the following:
The recent Segment Anything Model (SAM) represents a big leap in scaling up segmentation models, allowing for powerful zero-shot capabilities and flexible prompting. Despite being trained with 1.1 billion masks, SAM's mask prediction quality falls short in many cases, particularly when dealing with objects that have intricate structures. We propose HQ-SAM, equipping SAM with the ability to accurately segment any object, while maintaining SAM's original promptable design, efficiency, and zero-shot generalizability. Our careful design reuses and preserves the pre-trained model weights of SAM, while only introducing minimal additional parameters and computation. We design a learnable High-Quality Output Token, which is injected into SAM's mask decoder and is responsible for predicting the high-quality mask. Instead of only applying it on mask-decoder features, we first fuse them with early and final ViT features for improved mask details. To train our introduced learnable parameters, we compose a dataset of 44K fine-grained masks from several sources. HQ-SAM is only trained on the introduced dataset of 44k masks, which takes only 4 hours on 8 GPUs.
Tips:
- SAM-HQ produces higher quality masks than the original SAM model, particularly for objects with intricate structures and fine details
- The model predicts binary masks with more accurate boundaries and better handling of thin structures
- Like SAM, the model performs better with input 2D points and/or input bounding boxes
- You can prompt multiple points for the same image and predict a single high-quality mask
- The model maintains SAM's zero-shot generalization capabilities
- SAM-HQ only adds ~0.5% additional parameters compared to SAM
- Fine-tuning the model is not supported yet
GraniteMoeHybrid
The GraniteMoeHybrid
model builds on top of GraniteMoeSharedModel
and Bamba
. Its decoding layers consist of state space layers or MoE attention layers with shared experts. By default, the attention layers do not use positional encoding.
D-FINE

The D-FINE model was proposed in D-FINE: Redefine Regression Task in DETRs as Fine-grained Distribution Refinement by
Yansong Peng, Hebei Li, Peixi Wu, Yueyi Zhang, Xiaoyan Sun, Feng Wu
The abstract from the paper is the following:
We introduce D-FINE, a powerful real-time object detector that achieves outstanding localization precision by redefining the bounding box regression task in DETR models. D-FINE comprises two key components: Fine-grained Distribution Refinement (FDR) and Global Optimal Localization Self-Distillation (GO-LSD).
FDR transforms the regression process from predicting fixed coordinates to iteratively refining probability distributions, providing a fine-grained intermediate representation that significantly enhances localization accuracy. GO-LSD is a bidirectional optimization strategy that transfers localization knowledge from refined distributions to shallower layers through self-distillation, while also simplifying the residual prediction tasks for deeper layers. Additionally, D-FINE incorporates lightweight optimizations in computationally intensive modules and operations, achieving a better balance between speed and accuracy. Specifically, D-FINE-L / X achieves 54.0% / 55.8% AP on the COCO dataset at 124 / 78 FPS on an NVIDIA T4 GPU. When pretrained on Objects365, D-FINE-L / X attains 57.1% / 59.3% AP, surpassing all existing real-time detectors. Furthermore, our method significantly enhances the performance of a wide range of DETR models by up to 5.3% AP with negligible extra parameters and training costs. Our code and pretrained models: this https URL.
CSM
The Conversational Speech Model (CSM) is the first open-source contextual text-to-speech model released by Sesame. It is designed to generate natural-sounding speech with or without conversational context. This context typically consists of multi-turn dialogue between speakers, represented as sequences of text and corresponding spoken audio.
Model Architecture:
CSM is composed of two LLaMA-style auto-regressive transformer decoders: a backbone decoder that predicts the first codebook token and a depth decoder that generates the remaining tokens. It uses the pretrained codec model Mimi, introduced by Kyutai, to encode speech into discrete codebook tokens and decode them back into audio.
The original csm-1b checkpoint is available under the Sesame organization on Hugging Face.
BitNet

Trained on a corpus of 4 trillion tokens, this model demonstrates that native 1-bit LLMs can achieve performance comparable to leading open-weight, full-precision models of similar size, while offering substantial advantages in computational efficiency (memory, energy, latency).
LlamaGuard
Llama Guard 4 is a new multimodal model designed to detect inappropriate content in images and text, whether used as input or generated as output by the model. It’s a dense 12B model pruned from Llama 4 Scout model, and it can run on a single GPU (24 GBs of VRAM). It can evaluate both text-only and image+text inputs, making it suitable for filtering both inputs and outputs of large language models. This enables flexible moderation pipelines where prompts are analyzed before reaching the model, and generated responses are reviewed afterwards for safety. It can also understand multiple languages.
TimesFM

TimesFM (Time Series Foundation Model) is a pretrained time-series foundation model proposed in A decoder-only foundation model for time-series forecasting by Abhimanyu Das, Weihao Kong, Rajat Sen, and Yichen Zhou. It is a decoder only model that uses non-overlapping patches of time-series data as input and outputs some output patch length prediction in an autoregressive fashion.
The abstract from the paper is the following:
Motivated by recent advances in large language models for Natural Language Processing (NLP), we design a time-series foundation model for forecasting whose out-of-the-box zero-shot performance on a variety of public datasets comes close to the accuracy of state-of-the-art supervised forecasting models for each individual dataset. Our model is based on pretraining a patched-decoder style attention model on a large time-series corpus, and can work well across different forecasting history lengths, prediction lengths and temporal granularities.
MLCD

The MLCD models were released by the DeepGlint-AI team in unicom, which focuses on building foundational visual models for large multimodal language models using large-scale datasets such as LAION400M and COYO700M, and employs sample-to-cluster contrastive learning to optimize performance. MLCD models are primarily used for multimodal visual large language models, such as LLaVA.
Janus

The Janus Model was originally proposed in Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation by DeepSeek AI team and later refined in Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling. Janus is a vision-language model that can generate both image and text output, it can also take both images and text as input.
Note
The model doesn't generate both images and text in an interleaved format. The user has to pass a parameter indicating whether to generate text or image.
The abstract from the original paper is the following:
In this paper, we introduce Janus, an autoregressive framework that unifies multimodal understanding and generation. Prior research often relies on a single visual encoder for both tasks, such as Chameleon. However, due to the differing levels of information granularity required by multimodal understanding and generation, this approach can lead to suboptimal performance, particularly in multimodal understanding. To address this issue, we decouple visual encoding into separate pathways, while still leveraging a single, unified transformer architecture for processing. The decoupling not only alleviates the conflict between the visual encoder's roles in understanding and generation, but also enhances the framework's flexibility. For instance, both the multimodal understanding and generation components can independently select their most suitable encoding methods. Experiments show that Janus surpasses previous unified model and matches or exceeds the performance of task-specific models. The simplicity, high flexibility, and effectiveness of Janus make it a strong candidate for next-generation unified multimodal models.
The abstract from the aforementioned Janus-Pro
paper, released afterwards, is the following:
In this work, we introduce Janus-Pro, an advanced version of the previous work Janus. Specifically, Janus-Pro incorporates (1) an optimized training strate (2) expanded training data, and (3) scaling to larger model size. With these improvements, Janus-Pro achieves significant advancements in both multimodal understanding and text-to-image instruction-following capabilities, while also enhancing the stability of text-to-image generation. We hope this work will inspire further exploration in the field. Code and models are publicly available.
InternVL
The InternVL3 family of Visual Language Models was introduced in InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models.
The abstract from the paper is the following:
We introduce InternVL3, a significant advancement in the InternVL series featuring a native multimodal pre-training paradigm. Rather than adapting a text-only large language model (LLM) into a multimodal large language model (MLLM) that supports visual inputs, InternVL3 jointly acquires multimodal and linguistic capabilities from both diverse multimodal data and pure-text corpora during a single pre-training stage. This unified training paradigm effectively addresses the complexities and alignment challenges commonly encountered in conventional post-hoc training pipelines for MLLMs. To further improve performance and scalability, InternVL3 incorporates variable visual position encoding (V2PE) to support extended multimodal contexts, employs advanced post-training techniques such as supervised fine-tuning (SFT) and mixed preference optimization (MPO), and adopts test-time scaling strategies alongside an optimized training infrastructure. Extensive empirical evaluations demonstrate that InternVL3 delivers superior performance across a wide range of multi-modal tasks. In particular, InternVL3-78B achieves a score of 72.2 on the MMMU benchmark, setting a new state-of-the-art among open-source MLLMs. Its capabilities remain highly competitive with leading proprietary models, including ChatGPT-4o, Claude 3.5 Sonnet, and Gemini 2.5 Pro, while also maintaining strong pure-language proficiency. In pursuit of open-science principles, we will publicly release both the training data and model weights to foster further research and development in next-generation MLLMs.
Overview of InternVL3 models architecture, which is the same as InternVL2.5. Taken from the original checkpoint.
Comparison of InternVL3 performance on OpenCompass against other SOTA VLLMs. Taken from the original checkpoint.
Kernel integration
We integrate some kernels in the transformers
library via the kernels
package: https://github.com/huggingface/kernels
We start with some kernels in the Llama model, and we iterate to identify the best performance optimizations
- Llama Kernel integration by @MekkCyber in #37092
- [kernels] use original forward at compile time by @gante in #37604
TP support
In the previous release, we've added TP support in order to run distributed inference. However, this is not supported for all quantization methods. We are progressively adding support to it. Right now, only compressed-tensors, fp8 and fp8-fbgemm support it.
- Attention Quantization with FBGemm & TP by @MekkCyber in #37384
- Restrict & Explain tp_plan for FBgemm by @MekkCyber in #37404
Quantization
AutoRound
From the AutoRound contributors:
AutoRound is an advanced quantization algorithm that delivers strong accuracy, even at 2-bit precision. It leverages sign gradient descent to fine-tune both rounding values and min-max clipping thresholds in just 200 steps ... More details here: https://github.com/intel/auto-round
- Add AutoRound quantization support by @wenhuach21 in #37393
Quantization Documentation
We have added two new sections to better understand and get started with quantization:
- Add "selecting a quantization method" doc by @DerekLiu35 in #37159
- Update quantization docs by @DerekLiu35 in #37439
GGUF
We've added GGUF support to gemma3 family models.
- Add GGUF support to Gemma3 Text backbone by @Isotr0py in #37424
- Support loading Gemma3 QAT GGUF models by @Isotr0py in #37649
Fast image processors
Most Vision Models and VLMs in Transformers can now benefit from fast image processors. By utilizing torch/torchvision functional transforms, these processors offer a substantial speedup when processing images compared to PiL/numpy functions, and support processing on both CPU and CUDA.
- See the list of updated models: #36978
- Learn more about fast image processors: Fast Image Processors
- Add Fast Image Processor for Perceiver by @rootonchair in #37176
- Add Fast Image Processor for Flava by @rootonchair in #37135
- Add Fast Image Processor for LayoutLMv2 by @rootonchair in #37203
- Add Fast Image Processor for LayoutLMv3 by @rootonchair in #37201
- Add Fast Image Processor for Donut by @rootonchair in #37081
- Add Fast LeViT Processor by @keetrap in #37154
- Add Fast Mobilenet-V2 Processor by @keetrap in #37113
- Add Fast owlvit Processor by @keetrap in #37164
- Add ImageProcessorFast to BiT processor by @Yann-CV in #37180
- Add Fast Yolos Processor by @keetrap in #37292
- Add Fast Chinese-CLIP Processor by @keetrap in #37012
- Add Fast Conditional-DETR Processor by @keetrap in #37071
- Fix broken add-fast-image-processor CLI by @yonigozlan in #37499
- Bridgetower fast image processor by @rootonchair in #37373
- Add Fast Grounding-Dino Processor by @keetrap in #37108
- Add Fast PVT Processor by @keetrap in #37204
- Add Fast Image Processor for PoolFormer by @rootonchair in #37182
- Add Fast Image Processor for MobileNetV1 by @dmdaksh in #37111
- Fast image processor for VitMatte added and bug in slow version fixed by @henrikm11 in #37616
- [Fast Processor] BEiT by @ariG23498 in #37005
- Add Swin2SR ImageProcessorFast by @thisisiron in #37169
- Add Fast Image Processor for vilt by @devxaitist in #37304
AutoDocstring
The new @auto_docstring
decorator makes it easier to add proper documentation when contributing a model without bloating the modeling code:
- [AutoDocstring] Based on inspect parsing of the signature by @ArthurZucker and @yonigozlan in #33771
- More info on how to use
@auto_docstring
: AutoDocstring
Custom generate
We now support custom generate
methods to be loaded from model.generate
. The custom generate
methods can be stored on the Hub, enabling quick distribution of experiments regarding new caches, decoding methods, heuristics, ...
from transformers import AutoModelForCausalLM, AutoTokenizer
# `generate` with `custom_generate` -> `generate` uses custom code
# note: calling the custom method prints "✨ using a custom generation method ✨"
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct", device_map="auto")
inputs = tokenizer(["The quick brown"], return_tensors="pt").to(model.device)
gen_out = model.generate(**inputs, custom_generate="transformers-community/custom_generate_example", trust_remote_code=True)
print(tokenizer.batch_decode(gen_out, skip_special_tokens=True))
You can find the docs here, and all custom generation methods by searching for the custom_generate
tag.
Chat CLI
The transformers-cli
command is updated to be simpler and cleaner, specifically for its chat
variant.
The following is now possible and recommended:
transformers chat Qwen/Qwen2.5-3B-Instruct
Additionally, almost any generate flag can now be passed as a positional argument, present and future, as opposed to being limited to a set of hardcoded flags, for example:
transformers chat Qwen/Qwen2.5-0.5B-Instruct do_sample=False max_new_tokens=10
- Transformers cli clean command by @LysandreJik in #37657
- [chat] clean code and add base help by @gante in #37892
- [
chat
] generate parameterization powered byGenerationConfig
and UX-related changes by @gante in #38047
Breaking changes
- 🚨 rm already deprecated pad_to_max_length arg by @itazap in #37617
- 🚨🚨🚨 Fix forward of Dinov2ForImageClassification for models with registers by @psandovalsegura in #37836
- 🔴 [VLM] Add base model without head by @zucchini-nlp in #37033
- 🔴 Video processors as a separate class by @zucchini-nlp in #35206
- 🚨🚨 Allow saving and loading multiple "raw" chat template files by @Rocketknight1 in #36588
- 🔴 Update CLIP vision attention to new attention interface by @molbap in #37498
- 🚨🚨 Setup -> setupclass conversion by @Rocketknight1 in #37282
Deprecations
The agents folder is finally removed from transformers
in favour of using smolagents
.
We are moving away from torch 2.0 as it has been released more than two years ago.
General bugfixes and improvements
- fix flex attn when optional args aren't passed by @winglian in #37327
- fix llama4 training by @hiyouga in #37319
- Fix deepspeed with quantization by @Cyrilvallez in #37324
- Fix
init empty weights
without accelerate by @Cyrilvallez in #37337 - Use Python 3.9 syntax in examples by @cyyever in #37279
- Fix torchao usage by @jiqing-feng in #37034
- enable 2 llama UT cases on xpu by @yao-matrix in #37126
- Avoid build crashes when torch.version.xpu doesn't exist and fix Llama4 processor tests by @Rocketknight1 in #37346
- fix derived berts
_init_weights
by @Cyrilvallez in #37341 - Update translation template by @stevhliu in #37294
- Remove HQQ from caching allocator warmup by @Cyrilvallez in #37347
- updated model card for Mistral by @NahieliV in #37156
- Update model-card for DINOv2 by @shubham0204 in #37104
- Update falcon mamba card by @ricalanis in #37253
- Update Model card for GPT2 by @ash-01xor in #37101
- Improvements in Gemma2 model card by @devesh-2002 in #37076
- Update Model Card for Jamba by @ParagEkbote in #37152
- Add bnb to the list of supported quantization methods for LLama4 by @MekkCyber in #37348
- Updated Model-card for donut by @Logeswaran7 in #37290
- Remove unnecessary attr assignment by @tugsbayasgalan in #36837
- more fixes for post-training llama4 by @winglian in #37329
- Fixing flex attention for torch=2.6.0 by @SalmanMohammadi in #37285
- Multiple llama4 fixe by @ArthurZucker in #37353
- Expose blip2qformer by @alex-jw-brooks in #37254
- convert float for yarn related arguments in rope_scaling by @bzantium in #37139
- Use Python 3.9 syntax in tests by @cyyever in #37343
- A bit of cleaning 🧹🧹 by @Cyrilvallez in #37215
- fix deepspeed job by @ydshieh in #37284
- Set vision config to None for Gemma 1B conversion by @RyanMullins in #37366
- [llama 4] dynamic rope decorator by @gante in #37365
- Skip non-selected experts for mixtral and qwen2_moe by @Coco58323 in #32429
- [core] remove
GenerationMixin
inheritance by default inPreTrainedModel
by @gante in #37173 - prune LM Head for USD by @jmamou in #36695
- fix(qwen): fix shape error when using tp by @KimmiShi in #36947
- Preserve requires_grad in pre quantized model by @jerryzh168 in #37354
- Update composition flag usage by @zucchini-nlp in #36263
- fix: llama4 conversion script no_rope_layers by @jmkuebler in #37359
- update deepspeed docker by @SunMarc in #37371
- Fix warning message for PEFT models in text-generation pipeline #36783 by @falconlee236 in #36887
- Apply torchfix to replace deprecated functions:
_pytree._register_pytree_node
andtorch.cpu.amp.autocast
by @bzhong-solink in #37372 - Fix some failing AWQ tests by @DerekLiu35 in #37383
- the fix that did not get in by @ArthurZucker in #37370
- handle torch version edge cases by @winglian in #37399
- Add warning when failed to acquire other user's lock at model download by @manueldeprada in #37395
- Handle torch ver in flexattn by @Kh4L in #37400
- Fix Llama4 offset by @Cyrilvallez in #37414
- Offloaded hybrid cache for Llama4 by @Cyrilvallez in #37401
- mark llama4 as not supported with fa2 by @winglian in #37416
- update
kernels
to 0.4.3 by @ArthurZucker in #37419 - Send trainer/fsdp/deepspeed CI job reports to a single channel by @ydshieh in #37411
- from_pretrained should handle xpu case by @sywangyi in #37382
- Allow rocm systems to run these tests by @ivarflakstad in #37278
- use
rms_norm_eps
for the L2Norm for Llama4 by @ArthurZucker in #37418 - [chat-template] Unify tests and clean up 🧼 by @zucchini-nlp in #37275
- Fix new failure reports not including anything other than
tests/models/
by @ydshieh in #37415 - Quark Quantization gated repo by @MekkCyber in #37412
- Add image classifier donut & update loss calculation for all swins by @eljandoubi in #37224
- Correctly drop tokens in SwitchTransformer by @mario-aws in #37123
- Fix require_read_token by @MekkCyber in #37422
- fix: use mtime by default in Trainer._rotate_checkpoints with automatic fallback by @Jerry-Terrasse in #37260
- (Part 2) feat: allow for tp_size attr for tplizing the model by @kmehant in #37054
- Adding to self_comment_ci.yml by @MekkCyber in #37426
- [Feat] Support npu in modeling models by @duanjunwen in #37369
- Remove old code for PyTorch, Accelerator and tokenizers by @cyyever in #37234
- enhance require_deterministic_for_xpu by @yao-matrix in #37437
- Fixes: Corrects file path for CUDA kernels by @DonggeunYu in #37438
- Simplify soft dependencies and update the dummy-creation process by @LysandreJik in #36827
- Update-kernel-pin by @ArthurZucker in #37448
- Add moe kernels by @ArthurZucker in #37376
- Fix the test fetcher by @LysandreJik in #37452
- Remove triton mlp kernel, not compiling for some models by @MekkCyber in #37449
- [processor] clean up mulitmodal tests by @zucchini-nlp in #37362
- [Regression] Fix Quark quantized model loading after refactorization by @BowenBao in #37407
- prevent creating a view/leaf param for low rank optimizers w FSDP by @winglian in #37379
- Disable kernels for quantization by @MekkCyber in #37446
- Add weights_only=True to torch.load by @cyyever in #37062
- Add XPU case to is_torch_bf16_gpu_available by @cyyever in #37132
- nit: typing use Llama4TextConfig instead of Llama4Config by @kmehant in #37430
- Delete hubconf.py by @Rocketknight1 in #37455
- Fix typing issues with SigLip2 by @EricWiener in #37356
- fix: (llama4) fix no_split_modules to be picked up for fsdpv1 and v2 sharding by @kmehant in #37462
- make test_snowman_image_captioning pass on XPU, by sharing same atol w/ ROCM by @yao-matrix in #37480
- Remove
fsspec
dependency which isn't directly used by transformers by @cyyever in #37318 - Fix tests failed with gated repos. by @ydshieh in #37484
- [ci] fix doc builder by @zucchini-nlp in #37489
- Fixed broken links by @cypherpepe in #37466
- Detect and fix most
_init_weights()
issues - make it work for composite models by @Cyrilvallez in #37070 - [bug] deprecated deta load_cuda_kernel, MultiScaleDeformableAttention by @chagmgang in #37443
- Fix mask handling for flex attention in llama/gemma2/mistral/qwen2 by @flukeskywalker in #37381
- Fix wrong argparse type in modular checker script by @seven-mile in #37472
- Fixing gated repo issues by @MekkCyber in #37463
- [qwen-omni] fix processor by @zucchini-nlp in #37493
- Remove deprecation warning for
num_logits_to_keep
by @Cyrilvallez in #37149 - Don't auto-assign reviewers when the author is in HF by @Rocketknight1 in #37500
- Detect and use device context manager or global device in
from_pretrained
by @Cyrilvallez in #37216 - Change default value of
attn_temperature_tuning
by @gmlwns2000 in #37501 - Llama4: remove redundant transpose of router_logits by @pbelevich in #37468
- fix: Restore explicit error surfacing for unexpected hub exceptions by @manueldeprada in #37525
- Fix missing return type for MLCD docs by @qubvel in #37527
- fix and enhance pipeline_webserver.md by @yao-matrix in #36992
- VDR task guide by @merveenoyan in #37485
- Update VITS model card by @princepride in #37335
- Refactor ColPali model documentation by @Soum-Soum in #37309
- enable 5 cases on XPU by @yao-matrix in #37507
- enable several cases on XPU by @yao-matrix in #37516
- enable
test_offloaded_cache_implementation
on XPU by @yao-matrix in #37514 - Fix BitsAndBytesConfig JSON serialization in TrainingArguments by @astefanutti in #37520
- enable 3 mpt test cases on XPU by @yao-matrix in #37546
- enable 6 rt_detr_v2 cases on xpu by @yao-matrix in #37548
- More appropriate cuda warmup in resource-constrained hardware by @Cyrilvallez in #37550
- Fixes hqq by following a new path for bias parameter in pre_quantized models by @MekkCyber in #37530
- convert scale and zero to cuda when using HQQ backend by @phymhan in #37425
- Keep Quark loading through meta device by @BowenBao in #37538
- Refactor torchao docs by @MekkCyber in #37490
- add FlashAttentionKwargs and seq_idx to flat collator by @garrett361 in #36456
- docs(typo): Update ISSUES.md, fix a small typo by @ in #37542
- Fix device issue for tapas (with
as_tensor
) by @ydshieh in #37551 - Make Ignored Columns ValueError More Informative by @wbuchanan in #33299
- Fix TimesFm doc issue by @Cyrilvallez in #37552
- Run
test_can_load_with_global_device_set
using a subprocess by @ydshieh in #37553 - Fix pixel attention mask padding in smolvlm by @ManuelFay in #37497
- [vlm] adjust max length for special tokens by @zucchini-nlp in #37342
- Add EfficientNet Image PreProcessor by @zshn25 in #37055
- Fix Mamba2 Grouped SSD Support in the torch_forward Path by @cyang49 in #37533
- All models can be initialized on meta device by @Cyrilvallez in #37563
- [chat template] fix security vulnerability by @zucchini-nlp in #37523
- [qwen-vl] Standardize config by @zucchini-nlp in #37268
- [TimesFM] use the main revison instead of revision for integration test by @kashif in #37558
- Fix qwen2audio wanr -> warn by @alex-jw-brooks in #37559
- Small fix on context manager detection by @Cyrilvallez in #37562
- [phi4] update conversion by @zucchini-nlp in #37579
- docs: fix typo by @tonyksong in #37567
- Ensure positive warm-up size by @Cyrilvallez in #37581
- Update Phi4 converter by @Cyrilvallez in #37594
- Fix Quark quantization config by @MekkCyber in #37578
- Gaudi: Add the bf16 support for hpu by @yuanwu2017 in #37568
- Fix some GPU OOM after #37553 by @ydshieh in #37591
- remove _run_third_party_device_tests by @jiqing-feng in #37445
- [Bugfix] Fix flash-attention func param mismatch and softmax_scale default value mistake on Ascend NPU by @FightingZhen in #37575
- Flag SpeechT5 flaky test by @molbap in #37587
- enable 6 gemma2 cases on XPU by @yao-matrix in #37564
- enable 6 modeling cases on XPU by @yao-matrix in #37571
- [Gemma3] compile ✨ by @gante in #37447
- Model debugger upgrades by @molbap in #37391
- [VLMs] use only
xxx_token_id
for multimodal tokens by @zucchini-nlp in #37573 - fix 2 encoder_decoder issues on XPU by @yao-matrix in #37572
- fix issue that some example with no trainer use accelerator.end_train… by @we1559 in #37435
- Deprecate modeling_utils.py classes by @qubvel in #37298
- Fixing the example in generation strategy doc by @jeasinema in #37598
- chore: update model card for SigLIP by @saswatmeher in #37585
- Fix InternVL attention when using qk_norm (38B and 78B) by @yonigozlan in #37620
- Remove torchvision requirement from AutoImageProcessor by @LysandreJik in #37457
- Allow Exclusion of Input IDs from RepetitionPenaltyLogitsProcessor by @alex-jw-brooks in #37625
- fix link in kv_cache.md by @manueldeprada in #37652
- Update longformer.md by @JihadHammoud02 in #37622
- Refactor phi doc by @JihadHammoud02 in #37583
- Fix Qwen2.5-Omni get_chunked_index chunking functionality by @imkero in #37631
- [fix] make legacy bnb code work by @cyr0930 in #37331
- [fix gemma] Set default value for output_attentions parameter in Gemma2 and Gemma… by @chenin-wang in #37633
- Restructure torchao quantization examples by @jerryzh168 in #37592
- Add test to ensure unknown exceptions reraising in utils/hub.py::cached_files() by @manueldeprada in #37651
- [test] update
test_past_key_values_format
by @gante in #37614 - [tests] Stricter generate + compilation test -- no recompilations allowed by @gante in #37629
- Fix ValueError when eval_do_concat_batches=False with examples by @jeffhataws in #37621
- Fixes #37219 : RecurrentGemma crashes for inputs longer than sliding window length by @manueldeprada in #37613
- Introduce GradientCheckpointingLayer by @qubvel in #37223
- [qwen-omni] fix training by @zucchini-nlp in #37517
- Fix duplicated weights in fp8 quantization by @Cyrilvallez in #37667
- Correct warm-up with fp8 by @Cyrilvallez in #37670
- Fixing quantization tests by @MekkCyber in #37650
- Fix autoround docs by @SunMarc in #37675
- Fix no_split_modules for Llama4 pretrained models by @astefanutti in #37673
- Refactor bitsandbytes doc by @MekkCyber in #37668
- enable mllama cases on xpu by @yao-matrix in #37644
- enable 6 granite cases on xpu by @yao-matrix in #37569
- [cleanup] remove old scripts in
/scripts
🧹 🧹 by @gante in #37676 - [docs] only build
en
docs in push CI by @gante in #37677 - typo update in the parameter name by @LunaticMaestro in #37655
- [Docs] Move models to appropriate section by @NielsRogge in #37338
- Add counters for dataset classes by @jiangyukunok in #37636
- enable blip2 and emu3 cases on XPU by @yao-matrix in #37662
- 🌐 [i18n-KO] Translated
siglip.md
to Korean by @devxaitist in #37145 - Updated model card for mbart and mbart50 by @Vishesh-Mistry in #37619
- fix: remove classmethod from
Qwen2_5OmniConfig.get_text_config
by @shahruk10 in #37690 - enable cpu offloading for Bark on xpu by @yao-matrix in #37599
- Pin torch == 2.6 on PR CI docker images for now by @ydshieh in #37695
- [cleanup] remove
/model_cards
🧹 🧹 by @gante in #37685 - Add maintainers for ROCm/Intel XPU/Ascend NPU by @Rocketknight1 in #37678
- [CI] add back
sacrebleu
(and document why) by @gante in #37700 - TransfoXL is deprecated, don't keep it in tested examples! by @Rocketknight1 in #37707
- [internvl] fix chat template by @zucchini-nlp in #37656
- Qwen 2.5 Omni: apply video defaults by @pcuenca in #37660
- [tests,
qwen2_5_omni
] fix flaky tests by @gante in #37721 - Process inputs directly in apply_chat_template in image-text-to-text pipeline by @yonigozlan in #35616
- enable 4 test_trainer cases on XPU by @yao-matrix in #37645
- Fix Aria tests by @jiqing-feng in #37444
- Fix inference bugs in Qwen2.5 Omni by @BakerBunker in #37701
- Fix torchao doc examples by @MekkCyber in #37697
- [tests] fix
test_nemotron_8b_generation_sdpa
by @faaany in #37665 - Make sure torch_is_available before using torch.distributed by @MekkCyber in #37693
- [VLMs] fix flash-attention tests by @zucchini-nlp in #37603
- fix: learning_rate logged as tensor causing save issue with deepspeed by @NanoCode012 in #37704
- Fix
embeds_to_talker
device in Qwen2.5-Omni by @BakerBunker in #37739 - Correctly raise errors when downloading tokenizer files by @Cyrilvallez in #37740
- [performance_optim] define flash attention mask on NPU device directly by @FightingZhen in #37698
- Skip all
AriaForConditionalGenerationIntegrationTest
onT4
by @ydshieh in #37746 - Update
MllamaForConditionalGenerationIntegrationTest
by @ydshieh in #37750 - Expand quantized data type support for tensor parallelism by @amd-xiaoyu12 in #37719
- [cache] fix
HybridCache
init whendevice
is passed by @gante in #37718 GPT2Model
StaticCache support by @poedator in #35761- [generate] skip compilation on cpu offload by @gante in #37709
- updated hidden_features for FlaxDinov2SwiGLUFFN in Dinov2 by @premmurugan229 in #37747
- Fix qwen2_5 get_rope_index tensor device locations by @rphmeier in #37597
- [generate] fix default autocompile case on gpu by @gante in #37756
- Fix wrong input shapes in doc-string of models by @kkew3 in #37729
- Refine parameter type annotations by @flashJd in #37666
- Fix tied weight loading with TP and loading sub state_dicts by @Cyrilvallez in #37758
- Fix load of rng state for resuming training from checkpoint by @winglian in #37162
- Fix typos in comments by @co63oc in #37694
- [deps] pin max
torch
version by @gante in #37760 - Guard DeepSpeed imports by @lewtun in #37755
- Fix auto-round hfoption by @MekkCyber in #37759
- Update model card for Gemma by @afafelwafi in #37674
- 🌐 [i18n-KO] Translated
roberta.md
to Korean by @garongkim in #37069 - [causal mask] fix preparation with multi-gpu by @zucchini-nlp in #37612
- unpin pytest<8 by @ydshieh in #37768
- Align gpt2 mask preparation to #37612 by @Cyrilvallez in #37787
- Fix typos in strings and comments by @co63oc in #37784
- Fix tensor parallel with non-floating dtypes by @Cyrilvallez in #37790
- Force torch>=2.6 with torch.load to avoid vulnerability issue by @Cyrilvallez in #37785
- fix mpt test of different outputs from cuda by @jiqing-feng in #37691
- [i18n-KO] Translated
keypoint_detection.md
to Korean by @rlaalsrl0922 in #36649 - chore: update SigLIP2 model card by @saswatmeher in #37624
- fix performance issue in convert_ids_to_tokens by @martin-harmonic in #37773
- Fix error message in
hub.py
by @srai9 in #37796 - Gemma3 is Torch Exportable by @guangy10 in #37728
- Fix the fsdp config cannot work issue. by @yuanwu2017 in #37549
- Define warmup allocator for torchao quantization by @MekkCyber in #37764
- Fix typos in strings and comments by @co63oc in #37799
- [doc] fix the code examples in qwen doc by @jiangyukunok in #37803
- Fix: Correct tensor shape comment in Mamba modeling by @ShadyPi in #37801
- [RT-DETR] Improve docs by @NielsRogge in #37814
- FIX: Faulty PEFT tests by @BenjaminBossan in #37757
- Add Optional to remaining types by @cyyever in #37808
- Fix error of HPU TP by @yuanwu2017 in #37782
- change XLA deprecated api by @SunMarc in #37741
- [config] revert #37603 by @zucchini-nlp in #37821
- [modular] Fix the prefix-based renaming if the old and new model share a common name suffix by @Cyrilvallez in #37829
- [tests] fix flaky pattern in
test_generate_continue_from_past_key_values
by @gante in #37724 - [tests] reorganize cache tests and clean memory between tests by @gante in #37684
- Revert change that breaks on Torch 2.1 by @Rocketknight1 in #37531
- Fix check of unecessary packages (issue #37626) by @HichTala in #37825
- Fix cache get item return type hints by @ChengLyu in #37847
- Fix Bitnet tokenizer in pipeline by @MekkCyber in #37861
- docs: Details for ambigious channel dimension assignment by @yaner-here in #37600
- Processor chat template: pass custom kwargs by @pcuenca in #37852
- Add Intel Gaudi doc by @regisss in #37855
- 🌐 [i18n-KO] Translated
electra.md
to Korean by @Kim-Ju-won in #36763 - Update modeling_llama4.py by @monk1337 in #37841
- Skip is_flaky tests in the CI by @Rocketknight1 in #37723
- Allow override inputs to export recipe by @guangy10 in #37508
- enable internvl UTs on XPU by @yao-matrix in #37779
- Llama Guard updates by @pcuenca in #37872
- update Clean_up_tokenization_spaces typos. by @zhanluxianshen in #37865
- fix error for _register_pytree_node in torch2.1.0 and fix bf16 assertion in xpu and npu by @jiaqiw09 in #37839
- make sure lr is not a tensor by @winglian in #37881
- Fix qwen2-vl-docs. by @zhanluxianshen in #37879
- uniformize kwargs for VisionTextDualEncoder by @tibor-reiss in #34563
- Fix: reassign in qwen3 moe model by @linkedlist771 in #37848
- update comment in image_processing_base.py to reference image_process… by @arjunaskykok in #37864
- Support FlaxPreTrainedModel to load model checkpoint from local subfolder safetensors by @Melody-coder923 in #37732
- [tests] Test all cache implementations by @gante in #37873
- [tests] reset logs in
torch.compile
test by @gante in #37894 - Fix Qwen3 tp plan with FP8 by @MekkCyber in #37871
- Enhance documentation to explain chat-based few-shot prompting by @MostHumble in #37828
- Support
AOPerModuleConfig
andinclude_embedding
by @jerryzh168 in #37802 - fixed gemma3 collection path pointing to llama 2 collection. by @dmgcsilva in #37899
- Fix typos in strings and comments by @co63oc in #37910
- Improve performance of
load_state_dict
by @woct0rdho in #37902 - 🌐 [i18n-KO] Translated
gpu_selection.md
to Korean by @nsbg in #36757 - Add usage example for DINOv2 by @baldassarreFe in #37398
- Aligning modling code for GPT2 to work with vLLM (fallback) by @ariG23498 in #36934
- Break weight tying when quantizing input embedding by @jerryzh168 in #37905
- [docs] logits docstring by @gante in #37929
- [D-FINE] Update names by @NielsRogge in #37957
- More fault tolerant notification service by @ivarflakstad in #37924
- [core] reuse unused reserved cuda memory when loading models by @gante in #37920
- Use T4 single GPU runner with more CPU RAM by @ydshieh in #37961
- [generate] Fix
vocab_size
access for multimodal models by @kurzdev in #37937 - Fix incorrect type annotation in get_auxiliary_logits by @Tanuj-rai in #37955
- [Ready to Merge][HFQuantizer] Squelch pydantic warnings by @kylesayrs in #37726
- Add GraniteMoeHybrid support for 4.0 by @Ssukriti in #37658
- add xpu memory check by @faaany in #37969
- [tests] Smaller model in slow cache tests by @gante in #37922
- [llava] one pixel is missing from padding when length is odd by @cyr0930 in #37819
- add job links to new model failure report by @ydshieh in #37973
- fix docs serving typos. by @zhanluxianshen in #37936
- Small typo lines 47 and 199 perf_infer_gpu_one.md by @nlhmnlhmnlhm in #37938
- Fix typos by @omahs in #37978
- [speech2text] fix init of sinusoidal embeddings by @gante in #37931
- Fix typo by @lkm2835 in #37964
- enable xpu in test_trainer by @yao-matrix in #37774
- fix FSDP + torch.compile bug when saving pretrained model by @Joaquinecc in #37725
- Enable granite speech 3.3 tests by @alex-jw-brooks in #37560
- Fix donut backtracking by @Rocketknight1 in #37788
- Fix Qwen models export with torch 2.7 by @guangy10 in #37985
- [offload] respect
max_memory
argument when factoring in unused reserved memory by @gante in #37982 - make aya vision 5 integration tests pass on xpu by @yao-matrix in #37990
- [chat template] separate jinja logic from tokenizers by @zucchini-nlp in #37602
- remove duplicate code by @kaixuanliu in #37991
- Add a check to import_utils.py to allow for use of faiss_gpu installation by @Fiona-Waters in #37997
- [CSM] tiny fix on generation by @eustlb in #38001
- Fix
pad
image transform for batched inputs by @sebasv in #37544 - Add ALL_ATTENTION_FUNCTIONS compatibility for Pixtral model by @uminaty in #37960
- Enable RUF013 to enforce optional typing by @cyyever in #37266
- Fix
Optional
typing by @qubvel in #38018 - Print commit SHA on slack message for new model notification. by @ydshieh in #38019
- [CI] remove duplicated message on GH comment to run slow tests by @gante in #37970
- [caches] Raise exception on offloaded static caches + multi device by @gante in #37974
- Skip
test_push_to_hub_with_saves_each_epoch
for now by @ydshieh in #38022 - Fix incorrect installation instructions (for issue #37476) by @Zephyr271828 in #37640
- Fix wording in
torchscript.md
by @Madghostek in #38004 - [VLMs] support attention backends by @zucchini-nlp in #37576
- make
test_speculative_decoding_non_distil
device-agnostic by @faaany in #38010 - enable mamba2 integration cases on xpu by @yao-matrix in #38006
- update bnb tests by @jiqing-feng in #38011
- [
AutoDocstring
] Based on inspect parsing of the signature by @ArthurZucker and @yonigozlan in #33771 - fix document masking for chunked attention by @winglian in #37429
- make mistral3 pass on xpu by @yao-matrix in #37882
- enable utils test cases on XPU by @yao-matrix in #38005
- [Temporary] Log some information in some pytest/pluggy internal places by @ydshieh in #37996
- Trigger CircleCI via GitHub Actions when
ready for review
by @ydshieh in #37885 - Disable
Trigger CircleCI via GitHub Actions when
ready for review` by @ydshieh in #38038 - Do not erase a cache_position passed explicitly to generate(), if there is one by @FremyCompany in #37986
- Support for version spec in requires & arbitrary mismatching depths across folders by @LysandreJik in #37854
- Re-Enable
Trigger CircleCI via GitHub Actions when "ready for review" by @ydshieh in #37885)
- Fix reduce-labels in BEIT Fast Image Processor by @simonreise in #38042
- Fix cache update! by @Cyrilvallez in #38046
- Fix linalg.norm for CovnNextV2 by @qubvel in #38015
- enable generation fsdp/utils cases on XPU by @yao-matrix in #38009
- fix(conversion): Fix size mismatch error during TF->PT model loading by @arjunaskykok in #38014
- [VLM] fix loading issues by @zucchini-nlp in #38051
- Fix OneFormer integration test by @qubvel in #38016
- Add AMD expectation to test_gpt2_sample by @ivarflakstad in #38079
- docs: fix md style by @imba-tjd in #38057
- Fix mt5 test on AMD devices by @ivarflakstad in #38081
- chore(qwen2): display warning log only when sliding window attention … by @edwardzjl in #36316
- fix the inconsist docstring in apply_chat_template by @lenijwp in #38069
- Fix tot update in trainer by @efsotr in #37923
- update seed_worker to set seed based on worker_id and rank by @gathierry in #37980
- uninstall
kernels
from docker images by @ydshieh in #38083 - Refactor image processor phi4 by @yonigozlan in #36976
- update
require_read_token
by @ydshieh in #38093 - add timeout for downloading the
librispeech_asr
dataset by @faaany in #38073 - fix: Propagate
lr_scheduler_kwargs
options to create LR Scheduler when LayerWiseDummyOptimizer is used by @BlackNoodle in #34559 - Disable report callbacks for certain training tests by @ivarflakstad in #38088
- [smolvlm] skip the test by @zucchini-nlp in #38099
- Fix bug in prefill_chunk_size that ignores disable_compile flag by @xmarva in #38067
- Fix
past_key_values
type hint in model output types by @ChengLyu in #37953 - [bug] fix llava processor to calculate unpadding size correctly by @cyr0930 in #37988
- fix
check_bad commit.py
gives wrong results by @ydshieh in #38107 - Fix InternVL interpolate_pos_encoding and add to video_processing_auto by @yonigozlan in #38092
- [CSM] update test for t4 runners by @eustlb in #38110
- Add style bot by @SunMarc in #38102
- Fix description and formatting errors in code docs by @bilibili12433014 in #38074
- enable finegrained_fp8 and granite_speech cases on XPU by @yao-matrix in #38036
- [video processor] fix tests by @zucchini-nlp in #38104
- Fix temporal padding in Qwen2VLImageProcessor when the number of frames is not divisible by temporal_patch_size by @ritwickchaudhry in #38076
- Fix auto batch size finder test by @ivarflakstad in #38125
- Add config validation and style tweaks by @Kirire in #37589
- Update trainer.md by @guspuffygit in #38113
- [docs] add uv installation instructions for source builds by @arjunaskykok in #37968
- Add
manueldeprada
torun_slow
whitelist by @manueldeprada in #38126 - enable d_fine finetuning properly by @SangbumChoi in #37962
- Fix incorrect attention mask truncate in WhisperFlashAttention2 by @OliBomby in #36477
- [Qwen3] Qwen3 MoE add tp plan for expert mlps by @hgt312 in #38135
- enable csm integration cases on xpu, all passed by @yao-matrix in #38140
- Remove head mask in generative models by @zucchini-nlp in #35786
- Hotfix: Flash Attention 2 support in Pixtral by @uminaty in #38146
- enable trainer test cases on xpu by @yao-matrix in #38138
- disable deepspeed when setting up fake trainer by @winglian in #38101
- Omit creation of positional IDs within ESM if applicable by @simonlevine in #38089
- [FIX] Save speed metrics to logs by @pavelgein in #38136
- enable autoround cases on XPU by @yao-matrix in #38167
- Include output embedding as well with
include_embedding
flag by @jerryzh168 in #37935 - Fix Qwen2.5 Omni
SinusoidsPositionEmbedding
precision by @BakerBunker in #38151 - Add optional RMSNorm support to BitNet quantization (config + layers) by @Codys12 in #38087
- [VLMs] add helpers to get multimodal encodings by @zucchini-nlp in #37743
- Bart: new cache format by @zucchini-nlp in #35314
- clean autoawq cases on xpu by @yao-matrix in #38163
- Disable
Trigger CircleCI by ready for review
by @ydshieh in #38171 - Disable
convert to draft
workflow by @ydshieh in #38177 - remove some commands from
fetch_tests
CircleCI job by @ydshieh in #38176 - Feat: add warnings for unused keys and rules in tensor parallel by @S1ro1 in #37893
- [ESM] Add flash-attention-2 backend for ESM-2 by @pstjohn in #38023
- Add args support for fast image processors by @yonigozlan in #37018
- Fix import torchao.prototype.low_bit_optim since torchao v0.11 by @baptxste in #38174
- fix bug in distributed loss test by @techkang in #38166
- [tests] remove
test_sdpa_equivalence
(redundant) by @gante in #37911 - Add Granite Speech Support by @alex-jw-brooks in #36801
- Add glm4 by @ArthurZucker in #37388
- Add Qwen2.5-Omni by @BakerBunker in #36752
- Add MLCD model by @tanhuajie in #36182
- Add TimesFM Time Series Forecasting Model by @jinan-zhou in #34082
- Add Janus model by @yaswanth19 in #36053
- Add InternVL (2.5 MPO) by @yonigozlan in #35968
- Add Bitnet model by @MekkCyber in #37742
- Samhq model addition by @sushmanthreddy in #35147
- Add D-FINE Model into Transformers by @VladOS95-cyber in #36261
- Add CSM model by @eustlb in #36719
Significant community contributions
The following contributors have made significant changes to the library over the last release:
- @cyyever
- Use Python 3.9 syntax in examples (#37279)
- Use Python 3.9 syntax in tests (#37343)
- Remove old code for PyTorch, Accelerator and tokenizers (#37234)
- Add weights_only=True to torch.load (#37062)
- Add XPU case to is_torch_bf16_gpu_available (#37132)
- Remove
fsspec
dependency which isn't directly used by transformers (#37318) - Add Optional to remaining types (#37808)
- Enable RUF013 to enforce optional typing (#37266)
- @yao-matrix
- enable 2 llama UT cases on xpu (#37126)
- enhance require_deterministic_for_xpu (#37437)
- make test_snowman_image_captioning pass on XPU, by sharing same atol w/ ROCM (#37480)
- fix and enhance pipeline_webserver.md (#36992)
- enable 5 cases on XPU (#37507)
- enable several cases on XPU (#37516)
- enable
test_offloaded_cache_implementation
on XPU (#37514) - enable 3 mpt test cases on XPU (#37546)
- enable 6 rt_detr_v2 cases on xpu (#37548)
- enable 6 gemma2 cases on XPU (#37564)
- enable 6 modeling cases on XPU (#37571)
- fix 2 encoder_decoder issues on XPU (#37572)
- enable mllama cases on xpu (#37644)
- enable 6 granite cases on xpu (#37569)
- enable blip2 and emu3 cases on XPU (#37662)
- enable cpu offloading for Bark on xpu (#37599)
- enable 4 test_trainer cases on XPU (#37645)
- enable internvl UTs on XPU (#37779)
- enable xpu in test_trainer (#37774)
- make aya vision 5 integration tests pass on xpu (#37990)
- enable mamba2 integration cases on xpu (#38006)
- make mistral3 pass on xpu (#37882)
- enable utils test cases on XPU (#38005)
- enable generation fsdp/utils cases on XPU (#38009)
- enable finegrained_fp8 and granite_speech cases on XPU (#38036)
- enable csm integration cases on xpu, all passed (#38140)
- enable trainer test cases on xpu (#38138)
- enable autoround cases on XPU (#38167)
- clean autoawq cases on xpu (#38163)
- @alex-jw-brooks
- @BakerBunker
- @rootonchair
- Add Fast Image Processor for Perceiver (#37176)
- Add Fast Image Processor for Flava (#37135)
- Add Fast Image Processor for LayoutLMv2 (#37203)
- Add Fast Image Processor for LayoutLMv3 (#37201)
- Add Fast Image Processor for Donut (#37081)
- Bridgetower fast image processor (#37373)
- Add Fast Image Processor for PoolFormer (#37182)
- @flukeskywalker
- Fix mask handling for flex attention in llama/gemma2/mistral/qwen2 (#37381)
- @keetrap
- Add Fast LeViT Processor (#37154)
- Add Fast Mobilenet-V2 Processor (#37113)
- Add Fast owlvit Processor (#37164)
- Add Fast Yolos Processor (#37292)
- Add Fast Chinese-CLIP Processor (#37012)
- Add Fast Conditional-DETR Processor (#37071)
- Add Fast Grounding-Dino Processor (#37108)
- Add Fast PVT Processor (#37204)
- @tanhuajie
- Add MLCD model (#36182)
- @jinan-zhou
- Add TimesFM Time Series Forecasting Model (#34082)
- @yaswanth19
- Add Janus model (#36053)
- @saswatmeher
- @cyr0930
- @wenhuach21
- Add AutoRound quantization support (#37393)
- @devxaitist
- @co63oc
- @guangy10
- @sushmanthreddy
- Samhq model addition (#35147)
- @VladOS95-cyber
- Add D-FINE Model into Transformers (#36261)
- @Ssukriti
- Add GraniteMoeHybrid support for 4.0 (#37658)