17 Feb 15:19

a22a437

v4.49.0: Helium, Qwen2.5-VL, SuperGlue, Granite Vision, Zamba2, GOT-OCR 2.0, DAB-DETR, Depth Pro, RT-DETRv2, GPTQModel Latest

Latest

New models

Helium

Helium-1 preview is a lightweight language model with 2B parameters, targeting edge and mobile devices. It supports the following languages: English, French, German, Italian, Portuguese, Spanish.

Add-helium by @ArthurZucker in #35669

Qwen2.5-VL

The Qwen2.5-VL model is an update to Qwen2-VL from Qwen team, Alibaba Group.

The abstract from this update is the following:

Qwen2.5-VL marks a major step forward from Qwen2-VL, built upon the latest Qwen2.5 LLM. We’ve accelerated training and testing through the strategic implementation of window attention within the ViT. The ViT architecture itself has been refined with SwiGLU and RMSNorm, aligning it more closely with the LLM’s structure. A key innovation is the expansion of native dynamic resolution to encompass the temporal dimension, in addition to spatial aspects. Furthermore, we’ve upgraded MRoPE, incorporating absolute time alignment on the time axis to allow the model to effectively capture temporal dynamics, regardless of frame rate, leading to superior video understanding.

add qwen2.5vl by @ShuaiBai623 in #35569

SuperGlue

The SuperGlue model was proposed in SuperGlue: Learning Feature Matching with Graph Neural Networks by Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz and Andrew Rabinovich.

This model consists of matching two sets of interest points detected in an image. Paired with the SuperPoint model, it can be used to match two images and estimate the pose between them. This model is useful for tasks such as image matching, homography estimation, etc.

Add SuperGlue model by @sbucaille in #29886

Granite Vision Support

The Granite Vision model is a variant of LLaVA-NeXT, leveraging a Granite language model alongside a SigLIP visual encoder. It utilizes multiple concatenated vision hidden states as its image features, similar to VipLlava. It also uses a larger set of image grid pinpoints than the original LlaVa-NeXT models to support additional aspect ratios.

Granite Vision Support by @alex-jw-brooks in #35579

Zamba2

Zamba2 is a large language model (LLM) trained by Zyphra, and made available under an Apache 2.0 license.

Zamba2-1.2B, Zamba2-2.7B and Zamba2-7B are hybrid models combining state-space models (Specifically Mamba) and transformer, and were trained using next-token prediction. Zamba2 uses shared transformer layers after every 6 mamba blocks. It uses the Mistral v0.1 tokenizer. We came to this architecture after a series of ablations at small scales. Zamba2-1.2B, Zamba2-2.7B and Zamba2-7B were pre-trained on 2T and 3T tokens, respectively.

Add Zamba2 by @pglorio in #34517

GOT-OCR 2.0

GOT-OCR2 works on a wide range of tasks, including plain document OCR, scene text OCR, formatted document OCR, and even OCR for tables, charts, mathematical formulas, geometric shapes, molecular formulas and sheet music. While this implementation of the model will only output plain text, the outputs can be further processed to render the desired format, with packages like pdftex, mathpix, matplotlib, tikz, verovio or pyecharts. The model can also be used for interactive OCR, where the user can specify the region to be recognized by providing the coordinates or the color of the region’s bounding box.

Add GOT-OCR 2.0 to Transformers by @yonigozlan in #34721

DAB-DETR

DAB-DETR is an enhanced variant of Conditional DETR. It utilizes dynamically updated anchor boxes to provide both a reference query point (x, y) and a reference anchor size (w, h), improving cross-attention computation. This new approach achieves 45.7% AP when trained for 50 epochs with a single ResNet-50 model as the backbone.

Add DAB-DETR for object detection by @conditionedstimulus in #30803

Depth PRO

DepthPro is a foundation model for zero-shot metric monocular depth estimation, designed to generate high-resolution depth maps with remarkable sharpness and fine-grained details. It employs a multi-scale Vision Transformer (ViT)-based architecture, where images are downsampled, divided into patches, and processed using a shared Dinov2 encoder. The extracted patch-level features are merged, upsampled, and refined using a DPT-like fusion stage, enabling precise depth estimation.

Add Apple's Depth-Pro for depth estimation by @geetu040 in #34583

RT-DETRv2

An improved Real-Time DEtection TRansformer (RT-DETR). RT-DETRv2 refines RT-DETR by introducing selective multi-scale feature extraction, a discrete sampling operator for broader deployment compatibility. These improvements yield a 0.3 to 1.4 increase in mAP metrics on the COCO dataset, all while maintaining the same parameter count and frames-per-second (FPS) performance.

Adding RTDETRv2 by @jadechoghari in #34773

Transformers-CLI

Transformers' CLI welcomes a new command: chat. This command starts a conversation with the model of your choosing directly in your terminal.

This feature exists in TRL and has been migrated to transformers for easier usage.

[Chat] Add Chat from TRL 🐈 by @gante in #35714

Processor Standardization

An ongoing work is to standardize the image processors so that their API is equivalent. Additionally, the processors are given a fast variant so that they are never blockers in the image processing pipelines.

In this release, several processors have been standardized and have seen their fast version be contributed.

OwlViT/Owlv2 post processing standardization by @qubvel in #34929
OmDet Turbo processor standardization by @qubvel in #34937
Grounding DINO Processor standardization by @qubvel in #34853
Refactoring of ImageProcessorFast by @yonigozlan in #35069
add Qwen2-VL image processor fast by @yonigozlan in #35733
Remove Multi-threaded image conversion for fast image processors by @yonigozlan in #36105

Breaking changes

DPT segmentation maps

DPT image processors did not support segmentation_maps, instead only requiring images. This has been fixed.
This adds an argument to the preprocess method, therefore users using arguments as positional arguments with that method may see changed behavior. We recommend using keyword arguments for such methods so as to not be bothered by the addition of new features.

🔴 🔴 🔴 Added segmentation maps support for DPT image processor by @simonreise in #34345

Image classification pipeline and single vs multi-label

The problem_type in the config.json file was read incorrectly by the pipeline, which mapped single-label to multi-label losses, and vice-versa. This has been fixed.

🚨🚨🚨 image-classification pipeline single-label and multi-label prob type squashing fns (sigmoid vs softmax) are backwards by @rwightman in #35848

Fixing the LayerNorm beta/gamma renames

The description of the pull request is the easiest way to understand the problem, why it exists, and how it is solved; please read the description below:

🚨🚨🚨 An attempt to fix #29554. Include 'LayerNorm.' in gamma/beta rename scope, optimize string search. by @rwightman in #35615

VLM cleanup

The ignore_index property of the llava configuration has been removed as it was not serving a purpose.

🔴 VLM: compile compatibility by @zucchini-nlp in #35724

Quantization

Quantization has received several improvements and fixes, including the contribution of FP8 quantization and the HIGGS quantization interface.

Additionally, we're replacing the AutoGPTQ implementaiton with GPTQModel from ModelCloud (see repository here)).

GPTQModel originated as major refractor of AutoGPTQ but is now a full-stand-in replacement with cleaner api, up-to-date model support, faster inference, higher quality quants.

Enable gptqmodel by @ jiqing-feng in #35012
Split and clean up GGUF quantization tests by @Isotr0py in #35502
Display warning for unknown quants config instead of an error by @SunMarc in #35963
Adding FP8 Quantization to transformers by @MekkCyber in #36026
New HIGGS quantization interfaces, JIT kernel compilation support. by @BlackSamorez in #36148

Generate

[generate] revert change in Aria: the maximum cache length must match max_length by @gante in #36120
🧹 remove generate-related objects and methods scheduled for removal in v4.48 by @gante in #35677
[generate] ...

Contributors

dmlap, booxter, and 116 other contributors

Assets 2

07 Feb 10:10

ArthurZucker

v4.48.3

298b3f1

Patch release v4.48.3

This ends the python3.9 issues mostly!

Add future import for Py < 3.10 (#35666) by @Rocketknight1

For some very niche cases, the new rope embedding introduced device failures

Fix device in rope module when using dynamic updates (#35608) by @Cyrilvallez

Num items in batch

Fix model kwargs (#35875) by @muellerzr: this is long due, sorry that it took so long. Some models were not compatible with the num_items_in_batch

Finally the fix to Gemma2 is propagated to paligemma2!

Paligemma: fix generation with Gemma2 (#36044) by @zucchini-nlp

Contributors

muellerzr, Rocketknight1, and 2 other contributors

Assets 2

30 Jan 19:52

ArthurZucker

v4.48.2

b673c16

Patch release v4.48.2

Sorry because the fixes for num_items_in_batches are not done yet 😓 To follow along see this PR, a new patch will be available soon!

Now, we mostly had BC issue with python version 3.9:

Restore is_torch_greater_or_equal_than for backward compatibility (#35734) by @tlrmchlsmth
Fix NoneType type as it requires py>=3.10 (#35843) by @SunMarc

Then we had a small regression for DBRX saving:

Fix: loading DBRX back from saved path (#35728) by @zucchini-nlp

Finally we have a fix for gemma and the hybrid attention architectures:

Fix mask slicing for models with HybridCache #35681 by @Cyrilvallez

Miscellaneous:

Fix is_causal being a tensor (#35791) by @IlyasMoutawwakil

Contributors

tlrmchlsmth, SunMarc, and 3 other contributors

Assets 2

20 Jan 16:36

ArthurZucker

v4.48.1

2e752ea

Patch release v4.48.1

Yet again we are dawned with a gradient accumulation fix! There is also a refactoring of the attention that let a small typo in, we made sure PHI is no longer broken!

Moonshine had a small issue when wrapping generate so we removed that!

[Phi] bias should be True (#35650) @ArthurZucker
Fix condition when GA loss bug fix is not performed (#35651) @techkang
Patch moonshine (#35731) @eustlb

🤗

Contributors

techkang, ArthurZucker, and eustlb

Assets 2

10 Jan 12:14

LysandreJik

v4.48.0

6bc0fbc

v4.48.0: ModernBERT, Aria, TimmWrapper, ColPali, Falcon3, Bamba, VitPose, DinoV2 w/ Registers, Emu3, Cohere v2, TextNet, DiffLlama, PixtralLarge, Moonshine

New models

ModernBERT

The ModernBert model was proposed in Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference by Benjamin Warner, Antoine Chaffin, Benjamin Clavié, Orion Weller, Oskar Hallström, Said Taghadouini, Alexis Galalgher, Raja Bisas, Faisal Ladhak, Tom Aarsen, Nathan Cooper, Grifin Adams, Jeremy Howard and Iacopo Poli.

It is a refresh of the traditional encoder architecture, as used in previous models such as BERT and RoBERTa.

It builds on BERT and implements many modern architectural improvements which have been developed since its original release, such as:

Rotary Positional Embeddings to support sequences of up to 8192 tokens.
Unpadding to ensure no compute is wasted on padding tokens, speeding up processing time for batches with mixed-length sequences.
GeGLU Replacing the original MLP layers with GeGLU layers, shown to improve performance.
Alternating Attention where most attention layers employ a sliding window of 128 tokens, with Global Attention only used every 3 layers.
Flash Attention to speed up processing.
A model designed following recent The Case for Co-Designing Model Architectures with Hardware, ensuring maximum efficiency across inference GPUs.
Modern training data scales (2 trillion tokens) and mixtures (including code ande math data)

Add ModernBERT to Transformers by @warner-benjamin in #35158

Aria

The Aria model was proposed in Aria: An Open Multimodal Native Mixture-of-Experts Model by Li et al. from the Rhymes.AI team.

Aria is an open multimodal-native model with best-in-class performance across a wide range of multimodal, language, and coding tasks. It has a Mixture-of-Experts architecture, with respectively 3.9B and 3.5B activated parameters per visual token and text token.

Add Aria by @aymeric-roucher in #34157

TimmWrapper

We add a TimmWrapper set of classes such that timm models can be loaded in as transformer models into the library.

Here's a general usage example:

import torch
from urllib.request import urlopen
from PIL import Image
from transformers import AutoConfig, AutoModelForImageClassification, AutoImageProcessor

checkpoint = "timm/resnet50.a1_in1k"
img = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))

image_processor = AutoImageProcessor.from_pretrained(checkpoint)
inputs = image_processor(img, return_tensors="pt")
model = AutoModelForImageClassification.from_pretrained(checkpoint)

with torch.no_grad():
    logits = model(**inputs).logits

top5_probabilities, top5_class_indices = torch.topk(logits.softmax(dim=1) * 100, k=5)

Thanks to this, timm models now have access to pipelines, as well as Trainer, accelerate device maps, quantization, etc:

import torch
from urllib.request import urlopen
from PIL import Image

from transformers import pipeline

img = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))
pipe = pipeline("image-classification", model="timm/resnet18.a1_in1k")
print(pipe(img))

Add TimmWrapper by @qubvel and @amyeroberts in #34564

Pixtral-Large

Pixtral modeling and checkpoint conversion code has been updated to support the new Pixtral-Large model.

Update Pixtral conversion script to support large format! by @ArthurZucker in #34801

ColPali

The ColPali model was proposed in ColPali: Efficient Document Retrieval with Vision Language Models by Manuel Faysse*, Hugues Sibille*, Tony Wu*, Bilel Omrani, Gautier Viaud, Céline Hudelot, Pierre Colombo (* denotes equal contribution). Work lead by ILLUIN Technology.

In the proposed ColPali approach, the authors leverage VLMs to construct efficient multi-vector embeddings directly from document images (“screenshots”) for document retrieval. They train the model to maximize the similarity between these document embeddings and the corresponding query embeddings, using the late interaction method introduced in ColBERT.

Add ColPali to 🤗 transformers by @tonywu71 and @yonigozlan in #33736

Falcon3

Falcon3 represents a natural evolution from previous releases, emphasizing expanding the models’ science, math, and code capabilities. This iteration includes five base models: Falcon3-1B-Base, Falcon3-3B-Base, Falcon3-Mamba-7B-Base, Falcon3-7B-Base, and Falcon3-10B-Base. In developing these models, the authors incorporated several key innovations aimed at improving the models’ performances while reducing training costs:

One pre-training: They conducted a single large-scale pretraining run on the 7B model, using 2048 H100 GPU chips, leveraging 14 trillion tokens featuring web, code, STEM, and curated high-quality and multilingual data. Depth up-scaling for improved reasoning: Building on recent studies on the effects of model depth, they upscaled the 7B model to a 10B parameters model by duplicating the redundant layers and continuing pre-training with 2TT of high-quality data. This yielded Falcon3-10B-Base which achieves state-of-the-art zero-shot and few-shot performance for models under 13B parameters. Knowledge distillation for better tiny models: To provide compact and efficient alternatives, we developed Falcon3-1B-Base and Falcon3-3B-Base by leveraging pruning and knowledge distillation techniques, using less than 100GT of curated high-quality data, thereby redefining pre-training efficiency.

Add Falcon3 documentation by @mokeddembillel in #35307

Bamba

Bamba-9B is a decoder-only language model based on the Mamba-2 architecture and is designed to handle a wide range of text generation tasks. It is trained from scratch using a two-stage training approach. In the first stage, the model is trained on 2 trillion tokens from the Dolma v1.7 dataset. In the second stage, it undergoes additional training on 200 billion tokens, leveraging a carefully curated blend of high-quality data to further refine its performance and enhance output quality.

Checkout all Bamba-9B model checkpoints here.

Add the Bamba Model by @fabianlim in #34982

VitPose

ViTPose is a state-of-the-art vision transformer-based model for human pose estimation, introduced by Yufei Xu, Jing Zhang, Qiming Zhang, and Dacheng Tao in "ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation”.

The model leverages the capabilities of vision transformers to accurately predict 2D human keypoints. Adopting a top-down approach, ViTPose estimates keypoints locations for each detected person, allowing it to be easily used with any object detection model.

Add VitPose by @SangbumChoi and @NielsRogge in #30530

DINOv2 with registers

The DINOv2 with Registers model was proposed in Vision Transformers Need Registers by Timothée Darcet, Maxime Oquab, Julien Mairal, Piotr Bojanowski.

The Vision Transformer (ViT) is a transformer encoder model (BERT-like) originally introduced to do supervised image classification on ImageNet.

Next, people figured out ways to make ViT work really well on self-supervised image feature extraction (i.e. learning meaningful features, also called embeddings) on images without requiring any labels. Some example papers here include DINOv2 and MAE.

The authors of DINOv2 noticed that ViTs have artifacts in attention maps. It’s due to the model using some image patches as “registers”. The authors propose a fix: just add some new tokens (called “register” tokens), which you only use during pre-training (and throw away afterwards). This results in:

no artifacts
interpretable attention maps
and improved performances.

Add DINOv2 with registers by @NielsRogge in #35348

Emu3

The Emu3 model was proposed in Emu3: Next-Token Prediction is All You Need by Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, Yingli Zhao, Yulong Ao, Xuebin Min, Tao Li, Boya Wu, Bo Zhao, Bowen Zhang, Liangdong Wang, Guang Liu, Zheqi He, Xi Yang, Jingjing Liu, Yonghua Lin, Tiejun Huang, Zhongyuan Wang.

Emu3 sets a new standard in multimodal AI by using next-token prediction to handle images, text, and videos. It simplifies multimodal modeling by tokenizing all data into a unified format and training a single transformer. Visual data is tokenized using vector quantization methods based on [VQ-VA...

Contributors

winglian, mfarre, and 81 other contributors

Assets 2

17 Dec 15:42

ArthurZucker

v4.47.1

241c04d

v4.47.1

Patch release v4.47.1

We waited a little bit to make sure it was stable, thanks @winglian for double checking and everyone for the fixes!

Fix GA loss bugs and add unit test (#35121)
Contributed by @techkang and @ArthurZucker.
Fix num_items_in_batch not being an integer (#35115))
Contributed by @xspirus.
Fix FSDP no longer working (#35212)
Contributed by @muellerzr.
Don't use no_sync when DeepSpeed doesn't support it for certain ZeRO configurations (#35212)
Contributed by @winglian.
Only import torch.distributed if it is available (#35133)
Contributed by @GaetanLepage.
[Whisper] Patch float type on MPS (#35295)
Contributed by @eustlb. 🔜 we should probably have MPS CIs to avoid repeating this!

Contributors

winglian, muellerzr, and 5 other contributors

Assets 2

05 Dec 17:45

LysandreJik

v4.47.0

5d7739f

v4.47.0: PaliGemma-2, I-JEPA, OLMo-2, LayerSkip, Tensor Parallel

New models

PaliGemma-2

PaliGemma 2 and PaliGemma are lightweight open vision-language models (VLM) inspired by PaLI-3, and based on open components like the SigLIP vision model and the Gemma language model. PaliGemma takes both images and text as inputs and can answer questions about images with detail and context, meaning that PaliGemma can perform deeper analysis of images and provide useful insights, such as captioning for images and short videos, object detection, and reading text embedded within images.

PaliGemma 2 is available in 3B, 10B, and 28B parameter sizes, which are based on Gemma 2 2B, 9B, and 27B models, respectively. The original PaliGemma models are available in the 3B size. For more information on Gemma model variants, see the Gemma models list. PaliGemma model variants support different pixel resolutions for image inputs, including 224 x 224, 448 x 448, and 896 x 896 pixels.

I-JEPA

The I-JEPA model was proposed in Image-based Joint-Embedding Predictive Architecture by Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, Nicolas Ballas. I-JEPA is a self-supervised learning method that predicts the representations of one part of an image based on other parts of the same image. This approach focuses on learning semantic features without relying on pre-defined invariances from hand-crafted data transformations, which can bias specific tasks, or on filling in pixel-level details, which often leads to less meaningful representations.

Add I-JEPA by @jmtzt in #33125

OLMo 2

The OLMo2 model is the successor of the OLMo model, which was proposed in OLMo: Accelerating the Science of Language Models.

The architectural changes from the original OLMo model to this model are:

RMSNorm is used instead of standard layer norm.
Norm is applied to attention queries and keys.
Norm is applied after attention/feedforward layers rather than before.

Commits:

Add OLMo November 2024 by @2015aroras in #34551
Rename OLMo November to OLMo2 by @2015aroras in #34864

Layer-Skip Llama

We add support for Meta's Layer-Skip Llama 3.2 1B model.

The Llama3.2 1B model was continually pretrained with LayerSkip recipe, early exit loss and layer dropout, as presented in Layer Skip: Enabling Early Exit Inference and Self-Speculative Decoding and is capable of performing self-speculative decoding: decode with earlier layers and verify with remaining layers.

Self-speculation (Layer-Skip Llama) by @ArthurZucker in #34240

Tensor Parallel implementation

This PR uses the torch.distributed.tensor.parallel subpackage to implement Tensor Parallel for Llama (as an example).

The motivation is multi-fold:

to make modeling code simple as single-worker case:
all manual TP implementations under if self.config.pretraining_tp > 1 can be removed.
to make tensor parallelism easily accessible by users:
added a model.tensor_parallel(device_mesh) method that allows users to turn a single-proc model into a parallel model. !- Please guide me to a right place to put this function/method if PreTrainedModel is not a preferred place. -!

This is the first PR of many to simplify and enable Tensor Parallel across models.

Simplify Tensor Parallel implementation with PyTorch TP by @kwen2501 in #34184

Farewell, Python 3.8

Python 3.8 reaches end of life, and, as such, we drop it from our CI.

Drop support for Python 3.8 by @ydshieh in #34314

GGUF improvements

Several improvements have been done to the GGUF support in transformers; notably by adding new architectures to the list of supported architectures.

Add T5 GGUF loading support by @junejae in #33389
Add GGUF for Mamba by @VladOS95-cyber in #34200
Add Nemotron GGUF Loading Support by @farrosalferro in #34725
Improve gguf tensor processing by @VladOS95-cyber in #34515
Fix use_parallel_residual and qkv_bias for StableLM GGUF config extraction by @Isotr0py in #34450

Fast processors

We continue the work to improve the speed of fast processors as detailed in this roadmap.

We contribute a fast processor to RT-DETR.

Add Image Processor Fast RT-DETR by @yonigozlan in #34354

New pipelines

A new pipeline has been added to transformers: image-text-to-text!

the pipeline support the following inputs:

unbatched images and text - images=image, text=text
batched images and text - images = [image, image], text= [text, text]
several images per prompt (only for models supporting the use of an image token) - images = [[image, image], [image]] or images=[image, image, image], text = ["... ......", "......"]
Chat templates (for models supporting them).

Add image text to text pipeline by @yonigozlan in #34170

Notable refactors

Separate chat templates into a single file

We have had several issues with chat templates because they're stored as single lines in the JSON config files:

Impossible to review diffs
Very hard to edit in the web UI (or in general)
Differences between processor templates in chat_template.json and tokenizer templates in tokenizer_config.json causing confusion
Some models use multiple templates, requiring a template dict, but we're trying to discourage that in future and move those models to single templates with conditional behaviour instead

The solution:

Just move chat templates to a single chat_template.jinja file in the repo
If multiple templates are required, then they should still be stored in the JSON file. This is not supported for Processor classes, so processors should always be able to save their template as a raw Jinja file. In general, we'll be gently deprecating multiple templates in future.
If a chat_template.jinja file is present, it overrides the JSON files. If a tokenizer is loaded with both Jinja and JSON chat templates and resaved, it should save only the Jinja file, and not have any chat_template entry in tokenizer_config.json.

For now, we continue saving in the old format by default. I'll probably keep it this way for several versions before making the new format the default, to ensure that most users are able to load the new format before it becomes common. Until then, the new format should mostly be used for testing, to make sure it's ready for deployment when we do the switch.

Separate chat templates into a single file by @Rocketknight1 in #33957

Large modular logic refactor

This PR largely rework the logic we use in the modular converter. It is (hopefully) clearer and maintainable. Instead of going in all directions, adding stuff, then deleting it if not needed, we now do the following:

visit all the modular file (record imports/functions/classes/assignments nodes)
- create function dependency mapping
for each import coming from another model:
- visit the corresponding file
- create function dependency mapping
- update mapping with function/assignment from the modular (updated/new functions)
- create the class dependency graph based on merged dependencies
update dependency graph of the modular with the functions and assignments imported from the other files
for each class recorded in the modular:
- if inherithing from class in another file:
  - replace call to super
  - find the dependencies after the node was replaced
  - follow (updated with modular defs) dependency mapping to add all nodes
- else:
  - only add needed imported functions (and their dependencies)
determine the needed imports and add them

Large modular logic refactoring by @Cyrilvallez in #34487

Community bugfixes and improvements

Remove graph breaks for torch.compile() in flash_attention_forward when Lllama Model is padding free tuned by @Abhishek-TAMU in #33932
Better defaults by @ArthurZucker in #34026
translated gguf.md into chinese by @blueingman in #34163
CI: fix failures by @zucchini-nlp in #34371
Zamba is an LM by @LysandreJik in #34342
add code generation to natural language processing section by @furtnerthomas in #34333
Fix pil_torch_interpolation_mapping import in image_processing_detr_fast by @yonigozlan in #34375
Add code sample docstrings and checkpoint reference for GLM models by @h3110Fr13nd in #34360
refactor: remove redundant if-condition and improve type correctness for convert_tokens_to_ids by @winstxnhdw in #34030
Ignore unsupported kwarg in ProcessorMixin call by @yonigozlan in #34285
[PEFT] Add warning for missing key in LoRA adapter by @BenjaminBossan in #34068
Fix torch.fx issue related to the new loss_kwargs keyword argument by @michaelbenayoun in #34380
Correct the new defaults by @Cyrilvallez in #34377
[auto. ping] Avoid sending empty info + add more team members by @ydshieh in #34383
Fix glm by @Cyrilvallez in #34388
Use non nested images and batched text Idefics2/3 by @yonigozlan in #34222
Fix onnx non-expotable ...

Contributors

winglian, pcuenca, and 125 other contributors

Assets 2

18 Nov 22:13

ArthurZucker

v4.46.3

052e652

Patch release v4.46.3

One small fix for FSDP + gradient accumulation loss issue!

FSDP grad accum fix, #34645 by @winglian

Contributors

winglian

Assets 2

05 Nov 18:21

ArthurZucker

v4.46.2

ccbd57a

Patch release v4.46.2

Mostly had to finish the gradient accumulation !
Thanks to @techkang and @Ryukijano 🤗

VLMs: fix number of image tokens (#34332) by @zucchini-nlp
fix pixtral processor (#34486) by @@molbap
enable average tokens across devices (#34373) by @techkang and @muellerzr
Update trainer for easier handling of accumulate, compile fixes, and … by @muellerzr and @Ryukijano
MPS: isin_mps_friendly can support 0D tensors (#34538) by @gante

Contributors

muellerzr, gante, and 4 other contributors

Assets 2

29 Oct 15:50

ArthurZucker

v4.46.1

bc598c0

Patch release v4.46.1

Patch release v4.4.61

This is mostly for fx and onnx issues!

** Fix regression loading dtype #34409 by @SunMarc
** LLaVa: latency issues #34460 by @zucchini-nlp
** Fix pix2struct #34374 by @IlyasMoutawwakil
** Fix onnx non-exposable inplace aten op #34376 by @IlyasMoutawwakil
** Fix torch.fx issue related to the new loss_kwargs keyword argument #34380 by @michaelbenayoun

Contributors

michaelbenayoun, SunMarc, and 2 other contributors

Assets 2

Releases: huggingface/transformers

v4.49.0: Helium, Qwen2.5-VL, SuperGlue, Granite Vision, Zamba2, GOT-OCR 2.0, DAB-DETR, Depth Pro, RT-DETRv2, GPTQModel

New models

Helium

Qwen2.5-VL

SuperGlue

Granite Vision Support

Zamba2

GOT-OCR 2.0

DAB-DETR

Depth PRO

RT-DETRv2

Transformers-CLI

Processor Standardization

Breaking changes

DPT segmentation maps

Image classification pipeline and single vs multi-label

Fixing the LayerNorm beta/gamma renames

VLM cleanup

Quantization

Generate

Contributors

Patch release v4.48.3

Patch release v4.48.3

Num items in batch

Contributors

Patch release v4.48.2

Patch release v4.48.2

Contributors

Patch release v4.48.1

Patch release v4.48.1

Contributors

v4.48.0: ModernBERT, Aria, TimmWrapper, ColPali, Falcon3, Bamba, VitPose, DinoV2 w/ Registers, Emu3, Cohere v2, TextNet, DiffLlama, PixtralLarge, Moonshine

New models

ModernBERT

Aria

TimmWrapper

Pixtral-Large

ColPali

Falcon3

Bamba

VitPose

DINOv2 with registers

Emu3

Contributors

v4.47.1

Patch release v4.47.1

Contributors

v4.47.0: PaliGemma-2, I-JEPA, OLMo-2, LayerSkip, Tensor Parallel

New models

PaliGemma-2

I-JEPA

OLMo 2

Layer-Skip Llama

Tensor Parallel implementation

Farewell, Python 3.8

GGUF improvements

Fast processors

New pipelines

Notable refactors

Separate chat templates into a single file

Large modular logic refactor

Community bugfixes and improvements

Contributors

Patch release v4.46.3

Contributors

Patch release v4.46.2

Patch release v4.46.2

Contributors

Patch release v4.46.1

Patch release v4.4.61

Contributors