Skip to content

v4.47.0: PaliGemma-2, I-JEPA, OLMo-2, LayerSkip, Tensor Parallel

Compare
Choose a tag to compare
@LysandreJik LysandreJik released this 05 Dec 17:45
· 239 commits to main since this release

New models

PaliGemma-2

PaliGemma 2 and PaliGemma are lightweight open vision-language models (VLM) inspired by PaLI-3, and based on open components like the SigLIP vision model and the Gemma language model. PaliGemma takes both images and text as inputs and can answer questions about images with detail and context, meaning that PaliGemma can perform deeper analysis of images and provide useful insights, such as captioning for images and short videos, object detection, and reading text embedded within images.

PaliGemma 2 is available in 3B, 10B, and 28B parameter sizes, which are based on Gemma 2 2B, 9B, and 27B models, respectively. The original PaliGemma models are available in the 3B size. For more information on Gemma model variants, see the Gemma models list. PaliGemma model variants support different pixel resolutions for image inputs, including 224 x 224, 448 x 448, and 896 x 896 pixels.

image

I-JEPA

The I-JEPA model was proposed in Image-based Joint-Embedding Predictive Architecture by Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, Nicolas Ballas. I-JEPA is a self-supervised learning method that predicts the representations of one part of an image based on other parts of the same image. This approach focuses on learning semantic features without relying on pre-defined invariances from hand-crafted data transformations, which can bias specific tasks, or on filling in pixel-level details, which often leads to less meaningful representations.

image

OLMo 2

image

The OLMo2 model is the successor of the OLMo model, which was proposed in OLMo: Accelerating the Science of Language Models.

The architectural changes from the original OLMo model to this model are:

  • RMSNorm is used instead of standard layer norm.
  • Norm is applied to attention queries and keys.
  • Norm is applied after attention/feedforward layers rather than before.

Commits:

Layer-Skip Llama

We add support for Meta's Layer-Skip Llama 3.2 1B model.

The Llama3.2 1B model was continually pretrained with LayerSkip recipe, early exit loss and layer dropout, as presented in Layer Skip: Enabling Early Exit Inference and Self-Speculative Decoding and is capable of performing self-speculative decoding: decode with earlier layers and verify with remaining layers.

image

Tensor Parallel implementation

This PR uses the torch.distributed.tensor.parallel subpackage to implement Tensor Parallel for Llama (as an example).

The motivation is multi-fold:

  1. to make modeling code simple as single-worker case:
    all manual TP implementations under if self.config.pretraining_tp > 1 can be removed.

  2. to make tensor parallelism easily accessible by users:
    added a model.tensor_parallel(device_mesh) method that allows users to turn a single-proc model into a parallel model. !- Please guide me to a right place to put this function/method if PreTrainedModel is not a preferred place. -!

This is the first PR of many to simplify and enable Tensor Parallel across models.

  • Simplify Tensor Parallel implementation with PyTorch TP by @kwen2501 in #34184

Farewell, Python 3.8

Python 3.8 reaches end of life, and, as such, we drop it from our CI.

GGUF improvements

Several improvements have been done to the GGUF support in transformers; notably by adding new architectures to the list of supported architectures.

Fast processors

We continue the work to improve the speed of fast processors as detailed in this roadmap.

We contribute a fast processor to RT-DETR.

New pipelines

A new pipeline has been added to transformers: image-text-to-text!

the pipeline support the following inputs:

  • unbatched images and text - images=image, text=text
  • batched images and text - images = [image, image], text= [text, text]
  • several images per prompt (only for models supporting the use of an image token) - images = [[image, image], [image]] or images=[image, image, image], text = ["... ......", "......"]
  • Chat templates (for models supporting them).

Notable refactors

Separate chat templates into a single file

We have had several issues with chat templates because they're stored as single lines in the JSON config files:

  • Impossible to review diffs
  • Very hard to edit in the web UI (or in general)
  • Differences between processor templates in chat_template.json and tokenizer templates in tokenizer_config.json causing confusion
  • Some models use multiple templates, requiring a template dict, but we're trying to discourage that in future and move those models to single templates with conditional behaviour instead

The solution:

  • Just move chat templates to a single chat_template.jinja file in the repo
  • If multiple templates are required, then they should still be stored in the JSON file. This is not supported for Processor classes, so processors should always be able to save their template as a raw Jinja file. In general, we'll be gently deprecating multiple templates in future.
  • If a chat_template.jinja file is present, it overrides the JSON files. If a tokenizer is loaded with both Jinja and JSON chat templates and resaved, it should save only the Jinja file, and not have any chat_template entry in tokenizer_config.json.

For now, we continue saving in the old format by default. I'll probably keep it this way for several versions before making the new format the default, to ensure that most users are able to load the new format before it becomes common. Until then, the new format should mostly be used for testing, to make sure it's ready for deployment when we do the switch.

Large modular logic refactor

This PR largely rework the logic we use in the modular converter. It is (hopefully) clearer and maintainable. Instead of going in all directions, adding stuff, then deleting it if not needed, we now do the following:

  • visit all the modular file (record imports/functions/classes/assignments nodes)
    • create function dependency mapping
  • for each import coming from another model:
    • visit the corresponding file
    • create function dependency mapping
    • update mapping with function/assignment from the modular (updated/new functions)
    • create the class dependency graph based on merged dependencies
  • update dependency graph of the modular with the functions and assignments imported from the other files
  • for each class recorded in the modular:
    • if inherithing from class in another file:
      • replace call to super
      • find the dependencies after the node was replaced
      • follow (updated with modular defs) dependency mapping to add all nodes
    • else:
      • only add needed imported functions (and their dependencies)
  • determine the needed imports and add them

Community bugfixes and improvements

Significant community contributions

The following contributors have made significant changes to the library over the last release:

  • @AhmedAlmaghz
    • [i18n-ar] Translated file : docs/source/ar/fast_tokenizers.md into Arabic (#33034)
    • [i18n-ar] Translated file : docs/source/ar/multilingual.md into Arabic (#33048)
    • [i18n-ar] Translated file : docs/source/ar/trainer.md into Arabic (#33080)
    • [i18n-ar] Translated file : docs/source/ar/torchscript.md into Arabic (#33079)
    • [i18n-ar] Translated file : docs/source/ar/benchmarks.md into Arabic (#33023)
  • @maximizemaxwell
    • 🌐 [i18n-KO] Translated perf_train_special.md to Korean (#34590)
    • 🌐 [i18n-KO] Translated bert.md to Korean (#34627)
    • 🌐 [i18n-KO] Translated marian.md to Korean (#34698)
    • 🌐 [i18n-KO] Translated encoder-decoder.md to Korean (#34880)
  • @2015aroras
    • Add OLMo November 2024 (#34551)
    • Rename OLMo November to OLMo2 (#34864)
  • @mgoin
    • Add optimized PixtralImageProcessorFast (#34836)