The official repository of the paper Online Gaussian Adaptation of Vision-Language Model.
Authors: Clément Fuchs*, Maxime Zanella*, Christophe De Vleeschouwer.
*Denotes equal contribution
OGA is an online adaptation method which builds a cache of samples with low zero-shot entropy along a data stream. This cache is then used to build a multivariate Gaussian model of the class conditional likelihoods of the observed features, finally computing updated predictions using a pseudo-bayesian Maximum A Posteriori (MAP) estimator. Main results averaged over 11 datasets are summarized in the 2 Tables below.
W/ standard prompts | ViT-B/16 | ViT-B/32 | ViT-L/14 | ResNet50 | ResNet101 |
---|---|---|---|---|---|
Zero-Shot | 65.3 | 61.9 | 72.6 | 58.7 | 59.5 |
TDA | 67.7 ↑2.4 |
62.3 ↑0.4 |
73.5 ↑0.9 |
59.3 ↑0.6 |
60.6 ↑1.1 |
DMN | 67.5 ↑2.2 |
61.8 ↓0.1 |
73.7 ↑1.1 |
58.6 ↓0.1 |
61.0 ↑1.5 |
OGA (ours) | 68.5 ↑3.2 |
62.9 ↑1.0 |
74.3 ↑1.7 |
59.8 ↑1.1 |
61.6 ↑2.1 |
W/ custom prompts | ViT-B/16 | ViT-B/32 | ViT-L/14 | ResNet50 | ResNet101 |
---|---|---|---|---|---|
Zero-Shot | 65.6 | 61.4 | 72.2 | 57.4 | 59.0 |
TDA | 66.9 ↑1.3 |
62.3 ↑0.9 |
73.9 ↑1.7 |
58.1 ↑0.7 |
59.4 ↑0.4 |
DMN | 66.4 ↑0.8 |
61.6 ↑0.2 |
74.4 ↑2.2 |
57.2 ↓0.2 |
60.3 ↑1.3 |
OGA (ours) | 67.3 ↑1.7 |
62.8 ↑1.4 |
74.7 ↑2.5 |
58.4 ↑1.0 |
60.6 ↑1.6 |
Additionally, we advocate for more rigorous evaluation practices, including increasing the number of runs and considering additional quantitative metrics, such as our proposed Expected Tail Accuracy (ETA), calculated as the average accuracy in the worst 10% of runs. See illustration below.
Figure 1. The presented results are averaged over 100 runs. We propose the Expected Tail Accuracy (ETA), i.e., the average over the 10% worst runs, in solid red line. Our method named OGA not only significantly outperforms competitors on average but also has an ETA exceeding their average accuracy on several datasets (e.g., ImageNet and Pets). See our paper https://arxiv.org/abs/2501.04352
The repository also includes a lightweight implementation of TDA and DMN for training free / zero-shot adaptation without test-time augmentations.
The repository is dependent on PyTorch and openai-clip.
Please follow DATASETS.md to install the datasets. You will get a structure with the following dataset names:
$DATA/
|–– caltech-101/
|–– oxford_pets/
|–– stanford_cars/
|–– oxford_flowers/
|–– food-101/
|–– fgvc_aircraft/
|–– sun397/
|–– dtd/
|–– eurosat/
|–– ucf101/
|–– imagenet/
The benchmarks are run using pre-computed features, as none of the available methods update the vision encoder. First, use compute_features.py to compute and store features and labels. Example :
python compute_features.py --data_root_path "E:/DATA" --backbone "vit_b16" --datasets 'sun397' 'imagenet' 'fgvc_aircraft' 'eurosat' 'food101' 'caltech101' 'oxford_pets' 'oxford_flowers' 'stanford_cars' 'dtd' 'ucf101'
/!\ Warning: The above command line overwrites previous features for the current architecture. The features and targets are stored in a "cache" subfolder within each dataset folder. It should look like
$DATA/
|–– caltech-101/
|--cache/
|–– oxford_pets/
|--cache/
|–– stanford_cars/
|--cache/
...
In our paper, we present results obtained atop TaskRes and CoOp. To reproduce the relevant results, you need to download the pre-computed prototypes from TransCLIP. Go to "Pre-computed prototypes" and download the 'Few_shot' folder from the provided drive. Place it in $DATA/clip_tuned_prompts/. It should look like
$DATA/
|–– clip_tuned_prompts/
|--Few_shot/
...
Results presented in our paper can be reproduced using main.py. Results are stored in a .json (for quantities such as average batch accuracy per dataset) and a .pickle (for detailed results such as accuracy per batch), at $DATA/results/. The randomness is controlled by the parameters --master_seed and --n_runs. For a same tuple of (master_seed, n_runs), the runs generated are always the same. Note that you may still observe slight variations in results depending on your CUDA and PyTorch versions or hardware specifications. Example :
python main.py --data_root_path "E:/DATA" --adapt_method_name "TDA" --datasets 'sun397' 'imagenet' 'fgvc_aircraft' 'eurosat' 'food101' 'caltech101' 'oxford_pets' 'oxford_flowers' 'stanford_cars' 'dtd' 'ucf101'
If you find this repository useful, please consider citing our paper:
@article{fuchs2025online,
title={Online Gaussian Test-Time Adaptation of Vision-Language Models},
author={Fuchs, Cl{\'e}ment and Zanella, Maxime and De Vleeschouwer, Christophe}
journal={arXiv preprint arXiv:2501.04352},
year={2025}
}
For any inquiries, please contact us at [email protected] and [email protected] or feel free to create an issue.