Skip to content

IDEFICS, GPTQ Quantization

Compare
Choose a tag to compare
@LysandreJik LysandreJik released this 22 Aug 13:11
· 3917 commits to main since this release
41aef33

IDEFICS

The IDEFICS model was proposed in OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents by Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander M. Rush, Douwe Kiela, Matthieu Cord, Victor Sanh

IDEFICS is the first open state-of-the-art visual language model at the 80B scale!

The model accepts arbitrary sequences of image and text and produces text, similarly to a multimodal ChatGPT.

Blogpost: hf.co/blog/idefics
Playground: HuggingFaceM4/idefics_playground

image

MPT

MPT has been added and is now officially supported within Transformers. The repositories from MosaicML have been updated to work best with the model integration within Transformers.

GPTQ Integration

GPTQ quantization is now supported in Transformers, through the optimum library. The backend relies on the auto_gptq library, from which we use the GPTQ and QuantLinear classes.

See below for an example of the API, quantizing a model using the new GPTQConfig configuration utility.

from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig
model_name = "facebook/opt-125m"

tokenizer = AutoTokenizer.from_pretrained(model_name)
config = GPTQConfig(bits=4, dataset = "c4", tokenizer=tokenizer,  group_size=128, desc_act=False)
# works also with device_map (cpu offload works but not disk offload)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, quantization_config=config)

Most models under TheBloke namespace with the suffix GPTQ should be supported, for example, to load a GPTQ quantized model on TheBloke/Llama-2-13B-chat-GPTQ simply run (after installing latest optimum and auto-gptq libraries):

from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "TheBloke/Llama-2-13B-chat-GPTQ"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

For more information about this feature, we recommend taking a look at the following announcement blogpost: https://huggingface.co/blog/gptq-integration

Pipelines

A new pipeline, dedicated to text-to-audio and text-to-speech models, has been added to Transformers. It currently supports the 3 text-to-audio models integrated into transformers: SpeechT5ForTextToSpeech, MusicGen and Bark.

See below for an example:

from transformers import pipeline

classifier = pipeline(model="suno/bark")
output = pipeline("Hey it's HuggingFace on the phone!")

audio = output["audio"]
sampling_rate = output["sampling_rate"]

Classifier-Free Guidance decoding

Classifier-Free Guidance decoding is a text generation technique developed by EleutherAI, announced in this paper. With this technique, you can increase prompt adherence in generation. You can also set it up with negative prompts, ensuring your generation doesn't go in specific directions. See its docs for usage instructions.

Task guides

A new task guide going into Visual Question Answering has been added to Transformers.

Model deprecation

We continue the deprecation of models that was introduced in #24787.

By deprecating, we indicate that we will stop maintaining such models, but there is no intention of actually removing those models and breaking support for them (they might one day move into a separate repo/on the Hub, but we would still add the necessary imports to make sure backward compatibility stays). The main point is that we stop testing those models. The usage of the models drives this choice and aims to ease the burden on our CI so that it may be used to focus on more critical aspects of the library.

Translation Efforts

There are ongoing efforts to translate the transformers' documentation in other languages. These efforts are driven by groups independent to Hugging Face, and their work is greatly appreciated further to lower the barrier of entry to ML and Transformers.

If you'd like to kickstart such an effort or help out on an existing one, please feel free to reach out by opening an issue.

Explicit input data format for image processing

Addition of input_data_format argument to image transforms and ImageProcessor methods, allowing the user to explicitly set the data format of the images being processed. This enables processing of images with non-standard number of channels e.g. 4 or removes error which occur when the data format was inferred but the channel dimension was ambiguous.

import numpy as np
from transformers import ViTImageProcessor

img = np.random.randint(0, 256, (4, 6, 3))
image_processor = ViTImageProcessor()
inputs = image_processor(img, image_mean=0, image_std=1, input_data_format="channels_first")

Documentation clarification about efficient inference through torch.scaled_dot_product_attention & Flash Attention

Users are not aware that it is possible to force dispatch torch.scaled_dot_product_attention method from torch to use Flash Attention kernels. This leads to considerable speedup and memory saving, and is also compatible with quantized models. We decided to make this explicit to users in the documentation.

  • [Docs / BetterTransformer ] Added more details about flash attention + SDPA : #25265

In a nutshell, one can just run:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m")
model = AutoModelForCausalLM.from_pretrained("facebook/opt-350m").to("cuda")

# convert the model to BetterTransformer
model.to_bettertransformer()

input_text = "Hello my dog is cute and"
inputs = tokenizer(input_text, return_tensors="pt").to("cuda")

+ with torch.backends.cuda.sdp_kernel(enable_flash=True, enable_math=False, enable_mem_efficient=False):
    outputs = model.generate(**inputs)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

to enable Flash-attenion in their model. However, this feature does not support padding yet.

FSDP and DeepSpeed Changes

Users will no longer encounter CPU RAM OOM when using FSDP to train very large models in multi-gpu or multi-node multi-gpu setting.
Users no longer have to pass fsdp_transformer_layer_cls_to_wrap as the code now use _no_split_modules by default which is available for most of the popular models. DeepSpeed Z3 init now works properly with Accelerate Launcher + Trainer.

Breaking changes

Default optimizer in the Trainer class

The default optimizer in the Trainer class has been updated to be adam_torch rather than our own adam_hf, as the official Torch optimizer is more robust and fixes some issues.

In order to keep the old behavior, ensure that you pass "adamw_hf" as the optim value in your TrainingArguments.

  • 🚨🚨🚨Change default from adamw_hf to adamw_torch 🚨🚨🚨 by @muellerzr in #25109

ViVit and EfficientNet rescale bugfix

There was an issue with the definition of the rescale of values with ViVit and EfficientNet. These have been fixed, but will result in different model outputs for both of these models. To understand the change and see what needs to be done to obtain previous results, please take a look at the following PR.

Removing softmax for the image classification EfficientNet class

The EfficientNetForImageClassification model class did not follow conventions and added a softmax to the model logits. This was removed so that it respects the convention set by other models.

In order to obtain previous results, pass the model logits through a softmax.

  • 🚨🚨🚨 Remove softmax for EfficientNetForImageClassification 🚨🚨🚨 by @amyeroberts in #25501

Bug fixes with SPM models

Some SPM models had issues with their management of added tokens. Namely the Llama and T5, among others, were behaving incorrectly. These have been updated in #25224.

An option to obtain the previous behavior was added through the legacy flag, as explained in the PR linked above.

Bugfixes and improvements

Significant community contributions

The following contributors have made significant changes to the library over the last release:

  • @ranchlai
    • Add multi-label text classification support to pytorch example (#24770)
    • override .cuda() to check if model is already quantized (#25166)
    • fix get_keys_to_not_convert() to return correct modules for full precision inference (#25105)
    • add pathname and line number to logging formatter in debug mode (#25203)
    • add repr to the BitsAndBytesConfig class (#25517)
  • @wonhyeongseo
    • 🌐 [i18n-KO] Fixed Korean and English quicktour.md (#24664)
    • 🌐 [i18n-KO] Updated Korean serialization.md (#24686)
  • @Sunmin0520
    • 🌐 [i18n-KO] Translated testing.md to Korean (#24900)
  • @Xrenya
  • @susnato
    • Fix broken link in README_hd.md (#25067)
    • Add Pop2Piano (#21785)
  • @sjrl
    • [T5, MT5, UMT5] Add [T5, MT5, UMT5]ForSequenceClassification (#24726)
  • @Jackmin801
    • Allow trust_remote_code in example scripts (#25248)
  • @mjk0618
    • 🌐 [i18n-KO] Translated add_new_model.md to Korean (#24957)