Releases: huggingface/transformers
v4.49.0: Helium, Qwen2.5-VL, SuperGlue, Granite Vision, Zamba2, GOT-OCR 2.0, DAB-DETR, Depth Pro, RT-DETRv2
New models
Helium
Helium-1 preview is a lightweight language model with 2B parameters, targeting edge and mobile devices. It supports the following languages: English, French, German, Italian, Portuguese, Spanish.

- Add-helium by @ArthurZucker in #35669
Qwen2.5-VL
The Qwen2.5-VL model is an update to Qwen2-VL from Qwen team, Alibaba Group.
The abstract from this update is the following:
Qwen2.5-VL marks a major step forward from Qwen2-VL, built upon the latest Qwen2.5 LLM. We’ve accelerated training and testing through the strategic implementation of window attention within the ViT. The ViT architecture itself has been refined with SwiGLU and RMSNorm, aligning it more closely with the LLM’s structure. A key innovation is the expansion of native dynamic resolution to encompass the temporal dimension, in addition to spatial aspects. Furthermore, we’ve upgraded MRoPE, incorporating absolute time alignment on the time axis to allow the model to effectively capture temporal dynamics, regardless of frame rate, leading to superior video understanding.
- add qwen2.5vl by @ShuaiBai623 in #35569
SuperGlue
The SuperGlue model was proposed in SuperGlue: Learning Feature Matching with Graph Neural Networks by Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz and Andrew Rabinovich.
This model consists of matching two sets of interest points detected in an image. Paired with the SuperPoint model, it can be used to match two images and estimate the pose between them. This model is useful for tasks such as image matching, homography estimation, etc.

- Add SuperGlue model by @sbucaille in #29886
Granite Vision Support
The Granite Vision model is a variant of LLaVA-NeXT, leveraging a Granite language model alongside a SigLIP visual encoder. It utilizes multiple concatenated vision hidden states as its image features, similar to VipLlava. It also uses a larger set of image grid pinpoints than the original LlaVa-NeXT models to support additional aspect ratios.
- Granite Vision Support by @alex-jw-brooks in #35579
Zamba2
Zamba2 is a large language model (LLM) trained by Zyphra, and made available under an Apache 2.0 license.
Zamba2-1.2B, Zamba2-2.7B and Zamba2-7B are hybrid models combining state-space models (Specifically Mamba) and transformer, and were trained using next-token prediction. Zamba2 uses shared transformer layers after every 6 mamba blocks. It uses the Mistral v0.1 tokenizer. We came to this architecture after a series of ablations at small scales. Zamba2-1.2B, Zamba2-2.7B and Zamba2-7B were pre-trained on 2T and 3T tokens, respectively.
GOT-OCR 2.0
GOT-OCR2 works on a wide range of tasks, including plain document OCR, scene text OCR, formatted document OCR, and even OCR for tables, charts, mathematical formulas, geometric shapes, molecular formulas and sheet music. While this implementation of the model will only output plain text, the outputs can be further processed to render the desired format, with packages like pdftex, mathpix, matplotlib, tikz, verovio or pyecharts. The model can also be used for interactive OCR, where the user can specify the region to be recognized by providing the coordinates or the color of the region’s bounding box.
- Add GOT-OCR 2.0 to Transformers by @yonigozlan in #34721
DAB-DETR
DAB-DETR is an enhanced variant of Conditional DETR. It utilizes dynamically updated anchor boxes to provide both a reference query point (x, y) and a reference anchor size (w, h), improving cross-attention computation. This new approach achieves 45.7% AP when trained for 50 epochs with a single ResNet-50 model as the backbone.
- Add DAB-DETR for object detection by @conditionedstimulus in #30803
Depth PRO
DepthPro is a foundation model for zero-shot metric monocular depth estimation, designed to generate high-resolution depth maps with remarkable sharpness and fine-grained details. It employs a multi-scale Vision Transformer (ViT)-based architecture, where images are downsampled, divided into patches, and processed using a shared Dinov2 encoder. The extracted patch-level features are merged, upsampled, and refined using a DPT-like fusion stage, enabling precise depth estimation.
RT-DETRv2
An improved Real-Time DEtection TRansformer (RT-DETR). RT-DETRv2 refines RT-DETR by introducing selective multi-scale feature extraction, a discrete sampling operator for broader deployment compatibility. These improvements yield a 0.3 to 1.4 increase in mAP metrics on the COCO dataset, all while maintaining the same parameter count and frames-per-second (FPS) performance.
- Adding RTDETRv2 by @jadechoghari in #34773
Transformers-CLI
Transformers' CLI welcomes a new command: chat
. This command starts a conversation with the model of your choosing directly in your terminal.
This feature exists in TRL and has been migrated to transformers
for easier usage.
Processor Standardization
An ongoing work is to standardize the image processors so that their API is equivalent. Additionally, the processors are given a fast variant so that they are never blockers in the image processing pipelines.
In this release, several processors have been standardized and have seen their fast version be contributed.
- OwlViT/Owlv2 post processing standardization by @qubvel in #34929
- OmDet Turbo processor standardization by @qubvel in #34937
- Grounding DINO Processor standardization by @qubvel in #34853
- Refactoring of ImageProcessorFast by @yonigozlan in #35069
- add Qwen2-VL image processor fast by @yonigozlan in #35733
- Remove Multi-threaded image conversion for fast image processors by @yonigozlan in #36105
Breaking changes
DPT segmentation maps
DPT image processors did not support segmentation_maps
, instead only requiring images
. This has been fixed.
This adds an argument to the preprocess
method, therefore users using arguments as positional arguments with that method may see changed behavior. We recommend using keyword arguments for such methods so as to not be bothered by the addition of new features.
- 🔴 🔴 🔴 Added
segmentation maps
support for DPT image processor by @simonreise in #34345
Image classification pipeline and single vs multi-label
The problem_type
in the config.json file was read incorrectly by the pipeline, which mapped single-label to multi-label losses, and vice-versa. This has been fixed.
- 🚨🚨🚨 image-classification pipeline single-label and multi-label prob type squashing fns (sigmoid vs softmax) are backwards by @rwightman in #35848
Fixing the LayerNorm beta/gamma renames
The description of the pull request is the easiest way to understand the problem, why it exists, and how it is solved; please read the description below:
- 🚨🚨🚨 An attempt to fix #29554. Include 'LayerNorm.' in gamma/beta rename scope, optimize string search. by @rwightman in #35615
VLM cleanup
The ignore_index
property of the llava configuration has been removed as it was not serving a purpose.
- 🔴 VLM: compile compatibility by @zucchini-nlp in #35724
Quantization
Quantization has received several improvements and fixes, including the contribution of FP8 quantization and the HIGGS quantization interface.
- Split and clean up GGUF quantization tests by @Isotr0py in #35502
- Display warning for unknown quants config instead of an error by @SunMarc in #35963
- Adding FP8 Quantization to transformers by @MekkCyber in #36026
- New HIGGS quantization interfaces, JIT kernel compilation support. by @BlackSamorez in #36148
Generate
- [generate] revert change in Aria: the maximum cache length must match
max_length
by @gante in #36120 - 🧹 remove
generate
-related objects and methods scheduled for removal in v4.48 by @gante in #35677 - [generate] can instantiate
GenerationConfig(cache_implementation="static")
by @gante in #35679 - [generate] return Cache object even if passed in a legacy format by @gante in #35673
- [generate] update docstring of
SequenceBiasLogitsProcessor
by @gante in #35699 - Test: generate with
torch.compile(model.forward)
as a fast test by @gante in #34544 - [generate] move max time tests by @gante in #35962...
Patch release v4.48.3
Patch release v4.48.3
This ends the python3.9 issues mostly!
- Add future import for Py < 3.10 (#35666) by @Rocketknight1
For some very niche cases, the new rope embedding introduced device failures
- Fix device in rope module when using dynamic updates (#35608) by @Cyrilvallez
Num items in batch
- Fix model kwargs (#35875) by @muellerzr: this is long due, sorry that it took so long. Some models were not compatible with the
num_items_in_batch
Finally the fix to Gemma2 is propagated to paligemma2!
- Paligemma: fix generation with Gemma2 (#36044) by @zucchini-nlp
Patch release v4.48.2
Patch release v4.48.2
Sorry because the fixes for num_items_in_batches
are not done yet 😓 To follow along see this PR, a new patch will be available soon!
Now, we mostly had BC issue with python version 3.9:
- Restore is_torch_greater_or_equal_than for backward compatibility (#35734) by @tlrmchlsmth
- Fix NoneType type as it requires py>=3.10 (#35843) by @SunMarc
Then we had a small regression for DBRX saving:
- Fix: loading DBRX back from saved path (#35728) by @zucchini-nlp
Finally we have a fix for gemma and the hybrid attention architectures:
- Fix mask slicing for models with HybridCache #35681 by @Cyrilvallez
Miscellaneous:
- Fix is_causal being a tensor (#35791) by @IlyasMoutawwakil
Patch release v4.48.1
Patch release v4.48.1
Yet again we are dawned with a gradient accumulation fix! There is also a refactoring of the attention that let a small typo in, we made sure PHI is no longer broken!
Moonshine
had a small issue when wrapping generate so we removed that!
- [Phi] bias should be True (#35650) @ArthurZucker
- Fix condition when GA loss bug fix is not performed (#35651) @techkang
- Patch moonshine (#35731) @eustlb
🤗
v4.48.0: ModernBERT, Aria, TimmWrapper, ColPali, Falcon3, Bamba, VitPose, DinoV2 w/ Registers, Emu3, Cohere v2, TextNet, DiffLlama, PixtralLarge, Moonshine
New models
ModernBERT
The ModernBert model was proposed in Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference by Benjamin Warner, Antoine Chaffin, Benjamin Clavié, Orion Weller, Oskar Hallström, Said Taghadouini, Alexis Galalgher, Raja Bisas, Faisal Ladhak, Tom Aarsen, Nathan Cooper, Grifin Adams, Jeremy Howard and Iacopo Poli.
It is a refresh of the traditional encoder architecture, as used in previous models such as BERT and RoBERTa.
It builds on BERT and implements many modern architectural improvements which have been developed since its original release, such as:
- Rotary Positional Embeddings to support sequences of up to 8192 tokens.
- Unpadding to ensure no compute is wasted on padding tokens, speeding up processing time for batches with mixed-length sequences.
- GeGLU Replacing the original MLP layers with GeGLU layers, shown to improve performance.
- Alternating Attention where most attention layers employ a sliding window of 128 tokens, with Global Attention only used every 3 layers.
- Flash Attention to speed up processing.
- A model designed following recent The Case for Co-Designing Model Architectures with Hardware, ensuring maximum efficiency across inference GPUs.
- Modern training data scales (2 trillion tokens) and mixtures (including code ande math data)
- Add ModernBERT to Transformers by @warner-benjamin in #35158
Aria
The Aria model was proposed in Aria: An Open Multimodal Native Mixture-of-Experts Model by Li et al. from the Rhymes.AI team.
Aria is an open multimodal-native model with best-in-class performance across a wide range of multimodal, language, and coding tasks. It has a Mixture-of-Experts architecture, with respectively 3.9B and 3.5B activated parameters per visual token and text token.
- Add Aria by @aymeric-roucher in #34157
TimmWrapper
We add a TimmWrapper
set of classes such that timm models can be loaded in as transformer models into the library.
Here's a general usage example:
import torch
from urllib.request import urlopen
from PIL import Image
from transformers import AutoConfig, AutoModelForImageClassification, AutoImageProcessor
checkpoint = "timm/resnet50.a1_in1k"
img = Image.open(urlopen(
'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))
image_processor = AutoImageProcessor.from_pretrained(checkpoint)
inputs = image_processor(img, return_tensors="pt")
model = AutoModelForImageClassification.from_pretrained(checkpoint)
with torch.no_grad():
logits = model(**inputs).logits
top5_probabilities, top5_class_indices = torch.topk(logits.softmax(dim=1) * 100, k=5)
Thanks to this, timm models now have access to pipelines, as well as Trainer
, accelerate device maps, quantization, etc:
import torch
from urllib.request import urlopen
from PIL import Image
from transformers import pipeline
img = Image.open(urlopen(
'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))
pipe = pipeline("image-classification", model="timm/resnet18.a1_in1k")
print(pipe(img))
- Add TimmWrapper by @qubvel and @amyeroberts in #34564
Pixtral-Large
Pixtral modeling and checkpoint conversion code has been updated to support the new Pixtral-Large model.
- Update Pixtral conversion script to support large format! by @ArthurZucker in #34801
ColPali
The ColPali model was proposed in ColPali: Efficient Document Retrieval with Vision Language Models by Manuel Faysse*, Hugues Sibille*, Tony Wu*, Bilel Omrani, Gautier Viaud, Céline Hudelot, Pierre Colombo (* denotes equal contribution). Work lead by ILLUIN Technology.
In the proposed ColPali approach, the authors leverage VLMs to construct efficient multi-vector embeddings directly from document images (“screenshots”) for document retrieval. They train the model to maximize the similarity between these document embeddings and the corresponding query embeddings, using the late interaction method introduced in ColBERT.
- Add ColPali to 🤗 transformers by @tonywu71 and @yonigozlan in #33736
Falcon3
Falcon3 represents a natural evolution from previous releases, emphasizing expanding the models’ science, math, and code capabilities. This iteration includes five base models: Falcon3-1B-Base, Falcon3-3B-Base, Falcon3-Mamba-7B-Base, Falcon3-7B-Base, and Falcon3-10B-Base. In developing these models, the authors incorporated several key innovations aimed at improving the models’ performances while reducing training costs:
One pre-training: They conducted a single large-scale pretraining run on the 7B model, using 2048 H100 GPU chips, leveraging 14 trillion tokens featuring web, code, STEM, and curated high-quality and multilingual data. Depth up-scaling for improved reasoning: Building on recent studies on the effects of model depth, they upscaled the 7B model to a 10B parameters model by duplicating the redundant layers and continuing pre-training with 2TT of high-quality data. This yielded Falcon3-10B-Base which achieves state-of-the-art zero-shot and few-shot performance for models under 13B parameters. Knowledge distillation for better tiny models: To provide compact and efficient alternatives, we developed Falcon3-1B-Base and Falcon3-3B-Base by leveraging pruning and knowledge distillation techniques, using less than 100GT of curated high-quality data, thereby redefining pre-training efficiency.
- Add Falcon3 documentation by @mokeddembillel in #35307
Bamba
Bamba-9B is a decoder-only language model based on the Mamba-2 architecture and is designed to handle a wide range of text generation tasks. It is trained from scratch using a two-stage training approach. In the first stage, the model is trained on 2 trillion tokens from the Dolma v1.7 dataset. In the second stage, it undergoes additional training on 200 billion tokens, leveraging a carefully curated blend of high-quality data to further refine its performance and enhance output quality.
Checkout all Bamba-9B model checkpoints here.
- Add the Bamba Model by @fabianlim in #34982
VitPose
ViTPose is a state-of-the-art vision transformer-based model for human pose estimation, introduced by Yufei Xu, Jing Zhang, Qiming Zhang, and Dacheng Tao in "ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation”.
The model leverages the capabilities of vision transformers to accurately predict 2D human keypoints. Adopting a top-down approach, ViTPose estimates keypoints locations for each detected person, allowing it to be easily used with any object detection model.
- Add VitPose by @SangbumChoi and @NielsRogge in #30530
DINOv2 with registers
The DINOv2 with Registers model was proposed in Vision Transformers Need Registers by Timothée Darcet, Maxime Oquab, Julien Mairal, Piotr Bojanowski.
The Vision Transformer (ViT) is a transformer encoder model (BERT-like) originally introduced to do supervised image classification on ImageNet.
Next, people figured out ways to make ViT work really well on self-supervised image feature extraction (i.e. learning meaningful features, also called embeddings) on images without requiring any labels. Some example papers here include DINOv2 and MAE.
The authors of DINOv2 noticed that ViTs have artifacts in attention maps. It’s due to the model using some image patches as “registers”. The authors propose a fix: just add some new tokens (called “register” tokens), which you only use during pre-training (and throw away afterwards). This results in:
- no artifacts
- interpretable attention maps
- and improved performances.
- Add DINOv2 with registers by @NielsRogge in #35348
Emu3
The Emu3 model was proposed in Emu3: Next-Token Prediction is All You Need by Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, Yingli Zhao, Yulong Ao, Xuebin Min, Tao Li, Boya Wu, Bo Zhao, Bowen Zhang, Liangdong Wang, Guang Liu, Zheqi He, Xi Yang, Jingjing Liu, Yonghua Lin, Tiejun Huang, Zhongyuan Wang.
Emu3 sets a new standard in multimodal AI by using next-token prediction to handle images, text, and videos. It simplifies multimodal modeling by tokenizing all data into a unified format and training a single transformer. Visual data is tokenized using vector quantization methods based on [VQ-VA...
v4.47.1
Patch release v4.47.1
We waited a little bit to make sure it was stable, thanks @winglian for double checking and everyone for the fixes!
-
Fix GA loss bugs and add unit test (#35121)
Contributed by @techkang and @ArthurZucker. -
Fix num_items_in_batch not being an integer (#35115))
Contributed by @xspirus. -
Fix FSDP no longer working (#35212)
Contributed by @muellerzr. -
Don't use no_sync when DeepSpeed doesn't support it for certain ZeRO configurations (#35212)
Contributed by @winglian. -
Only import torch.distributed if it is available (#35133)
Contributed by @GaetanLepage. -
[Whisper] Patch float type on MPS (#35295)
Contributed by @eustlb. 🔜 we should probably have MPS CIs to avoid repeating this!
v4.47.0: PaliGemma-2, I-JEPA, OLMo-2, LayerSkip, Tensor Parallel
New models
PaliGemma-2
PaliGemma 2 and PaliGemma are lightweight open vision-language models (VLM) inspired by PaLI-3, and based on open components like the SigLIP vision model and the Gemma language model. PaliGemma takes both images and text as inputs and can answer questions about images with detail and context, meaning that PaliGemma can perform deeper analysis of images and provide useful insights, such as captioning for images and short videos, object detection, and reading text embedded within images.
PaliGemma 2 is available in 3B, 10B, and 28B parameter sizes, which are based on Gemma 2 2B, 9B, and 27B models, respectively. The original PaliGemma models are available in the 3B size. For more information on Gemma model variants, see the Gemma models list. PaliGemma model variants support different pixel resolutions for image inputs, including 224 x 224, 448 x 448, and 896 x 896 pixels.

I-JEPA
The I-JEPA model was proposed in Image-based Joint-Embedding Predictive Architecture by Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, Nicolas Ballas. I-JEPA is a self-supervised learning method that predicts the representations of one part of an image based on other parts of the same image. This approach focuses on learning semantic features without relying on pre-defined invariances from hand-crafted data transformations, which can bias specific tasks, or on filling in pixel-level details, which often leads to less meaningful representations.

OLMo 2

The OLMo2 model is the successor of the OLMo model, which was proposed in OLMo: Accelerating the Science of Language Models.
The architectural changes from the original OLMo model to this model are:
- RMSNorm is used instead of standard layer norm.
- Norm is applied to attention queries and keys.
- Norm is applied after attention/feedforward layers rather than before.
Commits:
- Add OLMo November 2024 by @2015aroras in #34551
- Rename OLMo November to OLMo2 by @2015aroras in #34864
Layer-Skip Llama
We add support for Meta's Layer-Skip Llama 3.2 1B model.
The Llama3.2 1B model was continually pretrained with LayerSkip recipe, early exit loss and layer dropout, as presented in Layer Skip: Enabling Early Exit Inference and Self-Speculative Decoding and is capable of performing self-speculative decoding: decode with earlier layers and verify with remaining layers.

- Self-speculation (Layer-Skip Llama) by @ArthurZucker in #34240
Tensor Parallel implementation
This PR uses the torch.distributed.tensor.parallel
subpackage to implement Tensor Parallel for Llama (as an example).
The motivation is multi-fold:
-
to make modeling code simple as single-worker case:
all manual TP implementations underif self.config.pretraining_tp > 1
can be removed. -
to make tensor parallelism easily accessible by users:
added amodel.tensor_parallel(device_mesh)
method that allows users to turn a single-proc model into a parallel model. !- Please guide me to a right place to put this function/method ifPreTrainedModel
is not a preferred place. -!
This is the first PR of many to simplify and enable Tensor Parallel across models.
Farewell, Python 3.8
Python 3.8 reaches end of life, and, as such, we drop it from our CI.
GGUF improvements
Several improvements have been done to the GGUF support in transformers; notably by adding new architectures to the list of supported architectures.
- Add T5 GGUF loading support by @junejae in #33389
- Add GGUF for Mamba by @VladOS95-cyber in #34200
- Add Nemotron GGUF Loading Support by @farrosalferro in #34725
- Improve gguf tensor processing by @VladOS95-cyber in #34515
- Fix
use_parallel_residual
andqkv_bias
for StableLM GGUF config extraction by @Isotr0py in #34450
Fast processors
We continue the work to improve the speed of fast processors as detailed in this roadmap.
We contribute a fast processor to RT-DETR.
- Add Image Processor Fast RT-DETR by @yonigozlan in #34354
New pipelines
A new pipeline has been added to transformers: image-text-to-text!
the pipeline support the following inputs:
- unbatched images and text - images=image, text=text
- batched images and text - images = [image, image], text= [text, text]
- several images per prompt (only for models supporting the use of an image token) - images = [[image, image], [image]] or images=[image, image, image], text = ["...
...
...", "...
..."]
- Chat templates (for models supporting them).
- Add image text to text pipeline by @yonigozlan in #34170
Notable refactors
Separate chat templates into a single file
We have had several issues with chat templates because they're stored as single lines in the JSON config files:
- Impossible to review diffs
- Very hard to edit in the web UI (or in general)
- Differences between
processor
templates inchat_template.json
andtokenizer
templates intokenizer_config.json
causing confusion - Some models use multiple templates, requiring a template dict, but we're trying to discourage that in future and move those models to single templates with conditional behaviour instead
The solution:
- Just move chat templates to a single
chat_template.jinja
file in the repo - If multiple templates are required, then they should still be stored in the JSON file. This is not supported for
Processor
classes, so processors should always be able to save their template as a raw Jinja file. In general, we'll be gently deprecating multiple templates in future. - If a
chat_template.jinja
file is present, it overrides the JSON files. If a tokenizer is loaded with both Jinja and JSON chat templates and resaved, it should save only the Jinja file, and not have anychat_template
entry intokenizer_config.json
.
For now, we continue saving in the old format by default. I'll probably keep it this way for several versions before making the new format the default, to ensure that most users are able to load the new format before it becomes common. Until then, the new format should mostly be used for testing, to make sure it's ready for deployment when we do the switch.
- Separate chat templates into a single file by @Rocketknight1 in #33957
Large modular logic refactor
This PR largely rework the logic we use in the modular converter. It is (hopefully) clearer and maintainable. Instead of going in all directions, adding stuff, then deleting it if not needed, we now do the following:
- visit all the modular file (record imports/functions/classes/assignments nodes)
- create function dependency mapping
- for each import coming from another model:
- visit the corresponding file
- create function dependency mapping
- update mapping with function/assignment from the modular (updated/new functions)
- create the class dependency graph based on merged dependencies
- update dependency graph of the modular with the functions and assignments imported from the other files
- for each class recorded in the modular:
- if inherithing from class in another file:
- replace call to super
- find the dependencies after the node was replaced
- follow (updated with modular defs) dependency mapping to add all nodes
- else:
- only add needed imported functions (and their dependencies)
- if inherithing from class in another file:
- determine the needed imports and add them
- Large modular logic refactoring by @Cyrilvallez in #34487
Community bugfixes and improvements
- Remove graph breaks for torch.compile() in flash_attention_forward when Lllama Model is padding free tuned by @Abhishek-TAMU in #33932
- Better defaults by @ArthurZucker in #34026
- translated gguf.md into chinese by @blueingman in #34163
- CI: fix failures by @zucchini-nlp in #34371
- Zamba is an LM by @LysandreJik in #34342
- add code generation to natural language processing section by @furtnerthomas in #34333
- Fix pil_torch_interpolation_mapping import in image_processing_detr_fast by @yonigozlan in #34375
- Add code sample docstrings and checkpoint reference for GLM models by @h3110Fr13nd in #34360
- refactor: remove redundant if-condition and improve type correctness for
convert_tokens_to_ids
by @winstxnhdw in #34030 - Ignore unsupported kwarg in ProcessorMixin call by @yonigozlan in #34285
- [PEFT] Add warning for missing key in LoRA adapter by @BenjaminBossan in #34068
- Fix
torch.fx
issue related to the newloss_kwargs
keyword argument by @michaelbenayoun in #34380 - Correct the new defaults by @Cyrilvallez in #34377
- [auto. ping] Avoid sending empty info + add more team members by @ydshieh in #34383
- Fix glm by @Cyrilvallez in #34388
- Use non nested images and batched text Idefics2/3 by @yonigozlan in #34222
- Fix onnx non-expotable ...
Patch release v4.46.3
Patch release v4.46.2
Patch release v4.46.2
Mostly had to finish the gradient accumulation !
Thanks to @techkang and @Ryukijano 🤗
- VLMs: fix number of image tokens (#34332) by @zucchini-nlp
- fix pixtral processor (#34486) by @@molbap
- enable average tokens across devices (#34373) by @techkang and @muellerzr
- Update trainer for easier handling of accumulate, compile fixes, and … by @muellerzr and @Ryukijano
- MPS: isin_mps_friendly can support 0D tensors (#34538) by @gante
Patch release v4.46.1
Patch release v4.4.61
This is mostly for fx
and onnx
issues!
** Fix regression loading dtype #34409 by @SunMarc
** LLaVa: latency issues #34460 by @zucchini-nlp
** Fix pix2struct #34374 by @IlyasMoutawwakil
** Fix onnx non-exposable inplace aten op #34376 by @IlyasMoutawwakil
** Fix torch.fx issue related to the new loss_kwargs
keyword argument #34380 by @michaelbenayoun