Skip to content

Github repository for the paper Which Transformer to Favor: A Comparative Analysis of Efficiency in Vision Transformers.

License

Notifications You must be signed in to change notification settings

tobna/WhatTransformerToFavor

Repository files navigation

Which Transformer to Favor:
A Comparative Analysis of Efficiency in Vision Transformers

First plot from the paper: Pareto front of throughput vs. accuracy

This is the code for the paper Which Transformer to Favor: A Comparative Analysis of Efficiency in Vision Transformers, a benchmark of over 45+ different efficient vision trainsformers. We train models from scratch and track multiple efficiency metrics.

Abstract

Self-attention in Transformers comes with a high computational cost because of their quadratic computational complexity, but their effectiveness in addressing problems in language and vision has sparked extensive research aimed at enhancing their efficiency. However, diverse experimental conditions, spanning multiple input domains, prevent a fair comparison based solely on reported results, posing challenges for model selection. To address this gap in comparability, we perform a large-scale benchmark of more than 45 models for image classification, evaluating key efficiency aspects, including accuracy, speed, and memory usage. Our benchmark provides a standardized baseline for efficiency-oriented transformers. We analyze the results based on the Pareto front -- the boundary of optimal models. Surprisingly, despite claims of other models being more efficient, ViT remains Pareto optimal across multiple metrics. We observe that hybrid attention-CNN models exhibit remarkable inference memory- and parameter-efficiency. Moreover, our benchmark shows that using a larger model in general is more efficient than using higher resolution images. Thanks to our holistic evaluation, we provide a centralized resource for practitioners and researchers, facilitating informed decisions when selecting or developing efficient transformers.

Updates

Requirements

This project heavily builds on timm and open source implementations of the models that are tested. All requirements are listed in requirements.txt. To install those, run

pip install -r requirements.txt

Usage

After cloning this repository, you can train and test a lot of different models. By default, a srun command is executed to run the code on a slurm cluster. To run on the local machine, append the -local flag to the command.

Dataset Preparation

Supported datasets are CIFAR10, ImageNet-21k, and ImageNet-1k.

The CIFAR10 dataset has to be located in a subfolder of the dataset root directory called CIFAR. This is the normal CIFAR10 from torchvision.datasets.

To speed up the data loading, the ImageNet datasets are read using datadings. The .msgpack files for ImageNet-1k should be located in <dataset_root_folder>/imagenet/msgpack, while the ones for ImageNet-21k should be in <dataset_root_folder>/imagenet-21k. See the datadings documentation for information on how to create those files.

Training

Pretraining

To pretrain a model on a given dataset, run

./main.py -model <model_name> -epochs <epochs> -dataset_root <dataset_root_folder>/ -results_folder <folder_for_results>/ -logging_folder <logging_folder> -run_name <name_or_description_of_the_run> (-local)

This will save a checkpoint (.tar file) every <save_epochs> epochs (the default is 10), which contains all the model weights, along with the optimizer and scheduler state, and the current training stats. The default pretraining dataset is ImageNet-21k.

Finetuning

A model (checkpoint) can be finetuned on another dataset using the following command:

./main.py -task fine-tune -model <model_checkpoint_file.tar> -epochs <epochs> -lr <lr> -dataset_root <dataset_root_folder>/ -results_folder <folder_for_results>/ -logging_folder <logging_folder> -run_name <name_or_description_of_the_run> (-local)

This will also save new checkpoints during training. The default finetuning dataset is ImageNet-1k.

Evaluation

It is also possible to evaluate the models. To evaluate the model's accuracy and the efficiency metrics, run

./main.py -task eval -model <model_checkpoint_file.tar> -dataset_root <dataset_root_folder>/ -results_folder <folder_for_results>/ -logging_folder <logging_folder> -run_name <name_or_description_of_the_run> (-local)

The default evaluation dataset is ImageNet-1k.

To only evaluate the efficiency metrics, run

./main.py -task eval-metrics -model <model_checkpoint_file.tar> -dataset_root <dataset_root_folder>/ -results_folder <folder_for_results>/ -logging_folder <logging_folder> -run_name <name_or_description_of_the_run> (-local)

This utilizes the CIFAR10 dataset by default.

Further Arguments

There can be multiple further arguments and flags given to the scripts. The most important ones are

Arg Description
-model <model> Model name or checkpoint.
-run_name <name for the run> Name or description of this training run.
-dataset <dataset> Specifies a dataset to use.
-task <task> Specifies a task. The default is pre-train.
-local Run on the local machine, not on a slurm cluster.
-dataset_root <dataset root> Root folder of the datasets.
-results_folder <results folder> Folder to save results into.
-logging_folder <logging folder> Folder for saving logfiles.
-epochs <epochs> Epochs to train.
-lr <lr> Learning rate. Default is 3e-3.
-batch_size <bs> Batch size. Default is 2048.
-weight_decay <wd> Weight decay. Default is 0.02.
-imsize <image resolution> Resulution of the image to train with. Default is 224.

For a list of all arguments, run

./main.py --help

Supported Models

These are the models we support. Links are to original code sources. If no link is provided, we implemented the architecture from scratch, following the specific paper.

Architecture Versions
AViT avit_tiny_patch16, avit_small_patch16, avit_base_patch16, avit_large_patch16
AFNO afno_ti_p16, afno_s_p16, afno_b_p16, afno_l_p16
CaiT cait_xxs24, cait_xxs36, cait_xs24, cait_s24, cait_s36, cait_m36, cait_m48
CoaT coat_tiny, coat_mini, coat_small, coat_lite_tiny, coat_lite_mini, coat_lite_small, coat_lite_medium
CvT cvt_13, cvt_21, cvt_w24
DeiT deit_tiny_patch16_LS, deit_small_patch16_LS, deit_medium_patch16_LS, deit_base_patch16_LS, deit_large_patch16_LS, deit_huge_patch14_LS, deit_huge_patch14_52_LS, deit_huge_patch14_26x2_LS, deit_Giant_48_patch14_LS, deit_giant_40_patch14_LS, deit_small_patch16_36_LS, deit_small_patch16_36, deit_small_patch16_18x2_LS, deit_small_patch16_18x2, deit_base_patch16_18x2_LS, deit_base_patch16_18x2, deit_base_patch16_36x1_LS, deit_base_patch16_36x1
DynamicViT dynamic_vit_tiny_patch16, dynamic_vit_90_tiny_patch16, dynamic_vit_70_tiny_patch16, dynamic_vit_small_patch16, dynamic_vit_90_small_patch16, dynamic_vit_70_small_patch16, dynamic_vit_base_patch16, dynamic_vit_70_base_patch16, dynamic_vit_90_base_patch16, dynamic_vit_large_patch16, dynamic_vit_90_large_patch16, dynamic_vit_70_large_patch16
EfficientFormerV2 efficientformerv2_s0, efficientformerv2_s1, efficientformerv2_s2, efficientformerv2_l
EfficientMod efficient_mod_xxs, efficient_mod_xs, efficient_mod_s
EfficientViT efficient_vit_b0, efficient_vit_b1, efficient_vit_b2, efficient_vit_b3, efficient_vit_l1, efficient_vit_l2
EViT evit_tiny_patch16, evit_tiny_patch16_fuse, evit_small_patch16, evit_small_patch16_fuse, evit_base_patch16, evit_base_patch16_fuse
Fast-ViT fastvit_t8, fastvit_t12, fastvit_s12, fastvit_sa12, fastvit_sa24, fastvit_sa36, fastvit_m36
FNet fnet_vit_tiny_patch16, fnet_vit_small_patch16, fnet_vit_base_patch16, fnet_vit_large_patch16, fnet_vit_tiny_patch4, fnet_vit_small_patch4, fnet_vit_base_patch4, fnet_vit_large_patch4
FocalNet focalnet_tiny_srf, focalnet_small_srf, focalnet_base_srf, focalnet_tiny_lrf, focalnet_small_lrf, focalnet_base_lrf, focalnet_tiny_iso, focalnet_small_iso, focalnet_base_iso, focalnet_large_fl3, focalnet_large_fl4, focalnet_xlarge_fl3, focalnet_xlarge_fl4, focalnet_huge_fl3, focalnet_huge_fl4
GFNet gfnet_tiny_patch4, gfnet_extra_small_patch4, gfnet_small_patch4, gfnet_base_patch4, gfnet_tiny_patch16, gfnet_extra_small_patch16, gfnet_small_patch16, gfnet_base_patch16
HaloNet halonet_h0, halonet_h1, halonet_h2
HiViT hi_vit_tiny_patch16, hi_vit_small_patch16, hi_vit_base_patch16, hi_vit_large_patch16
Hydra Attention hydra_vit_tiny_patch16, hydra_vit_small_patch16, hydra_vit_base_patch16, hydra_vit_large_patch16
Informer informer_vit_tiny_patch16, informer_vit_small_patch16, informer_vit_base_patch16, informer_vit_large_patch16
Linear Transformer linear_vit_tiny_patch16, linear_vit_small_patch16, linear_vit_base_patch16, linear_vit_large_patch16
Linformer linformer_vit_tiny_patch16, linformer_vit_small_patch16, linformer_vit_base_patch16, linformer_vit_large_patch16
MLP-Mixer mixer_s32, mixer_s16, mixer_b32, mixer_b16, mixer_l32, mixer_l16
Next-ViT nextvit_small, nextvit_base, nextvit_large
NyströmFormer nystrom64_vit_tiny_patch16, nystrom32_vit_tiny_patch16, nystrom64_vit_small_patch16, nystrom32_vit_small_patch16, nystrom64_vit_base_patch16, nystrom32_vit_base_patch16, nystrom64_vit_large_patch16, nystrom32_vit_large_patch16
Permormer performer_vit_tiny_patch16, performer_vit_small_patch16, performer_vit_base_patch16, performer_vit_large_patch16
PolySA polysa_vit_tiny_patch16, polysa_vit_small_patch16, polysa_vit_base_patch16, polysa_vit_large_patch16
Reformer reformer_vit_tiny_patch16, reformer_vit_small_patch16, reformer_vit_base_patch16, reformer_vit_large_patch16
ResNet resnet18, resnet34, resnet26, resnet50, resnet101, wide_resnet50_2
ResT rest_lite, rest_small, rest_base, rest_large
Routing Transformer routing_vit_tiny_patch16, routing_vit_small_patch16, routing_vit_base_patch16, routing_vit_large_patch16
Sinkhorn Transformer sinkhorn_cait_tiny_bmax32_patch16, sinkhorn_cait_tiny_bmax64_patch16, sinkhorn_cait_small_bmax32_patch16, sinkhorn_cait_small_bmax64_patch16, sinkhorn_cait_base_bmax32_patch16, sinkhorn_cait_base_bmax64_patch16, sinkhorn_cait_large_bmax32_patch16, sinkhorn_cait_large_bmax64_patch16
SLAB slab_tiny_patch16, slab_small_patch16
STViT stvit_swin_tiny_p4_w7, stvit_swin_base_p4_w7
SwiftFormer swiftformer_xs, swiftformer_s, swiftformer_l1, swiftformer_l3
Swin swin_tiny_patch4_window7, swin_small_patch4_window7, swin_base_patch4_window7, swin_large_patch4_window7
SwinV2 swinv2_tiny_patch4_window7, swinv2_small_patch4_window7, swinv2_base_patch4_window7, swinv2_large_patch4_window7
Switch Transformer switch_8_vit_tiny_patch16, switch_8_vit_small_patch16, switch_8_vit_base_patch16, switch_8_vit_large_patch16
Synthesizer synthesizer_fd_vit_tiny_patch16, synthesizer_fr_vit_tiny_patch16, synthesizer_fd_vit_small_patch16, synthesizer_fr_vit_small_patch16, synthesizer_fd_vit_base_patch16, synthesizer_fr_vit_base_patch16, synthesizer_fd_vit_large_patch16, synthesizer_fr_vit_large_patch16
TokenLearner token_learner_vit_8_50_tiny_patch16, token_learner_vit_8_75_tiny_patch16, token_learner_vit_8_50_small_patch16, token_learner_vit_8_75_small_patch16, token_learner_vit_8_50_base_patch16, token_learner_vit_8_75_base_patch16, token_learner_vit_8_50_large_patch16, token_learner_vit_8_75_large_patch16
ToMe tome_vit_tiny_r8_patch16, tome_vit_tiny_r13_patch16, tome_vit_small_r8_patch16, tome_vit_small_r13_patch16, tome_vit_base_r8_patch16, tome_vit_base_r13_patch16, tome_vit_large_r8_patch16, tome_vit_large_r13_patch16
ViT ViT-{Ti,S,B,L}/<patch_size>
Wave ViT wavevit_s, wavevit_b, wavevit_l
XCiT xcit_nano_12_p16, xcit_tiny_12_p16, xcit_small_12_p16, xcit_tiny_24_p16, xcit_small_24_p16, xcit_medium_24_p16, xcit_large_24_p16, xcit_nano_12_p8, xcit_tiny_12_p8, xcit_small_12_p8, xcit_tiny_24_p8, xcit_small_24_p8, xcit_medium_24_p8, xcit_large_24_p8

License

We release this code under the MIT license.

Citation

If you use this codebase in your project, please cite:

@misc{Nauen2023WTFBenchmark,
      title={Which Transformer to Favor: A Comparative Analysis of Efficiency in Vision Transformers}, 
      author={Tobias Christian Nauen and Sebastian Palacio and Andreas Dengel},
      year={2023},
      eprint={2308.09372},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      note={Accepted at WACV 2025}
}

About

Github repository for the paper Which Transformer to Favor: A Comparative Analysis of Efficiency in Vision Transformers.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages