- The codes of CLIPN with hand-crafted prompts are released (./hand-crafted).
- The codes of CLIPN with learnable prompts are released (./src).
- Thanks to the valuable suggestions from the reviewers of CVPR 2023 and ICCV 2023, our paper has been significantly improved, allowing it to be published at ICCV 2023.
- If you are interested in CLIP-based open vocabulary tasks, please feel free to visit our another work! "CLIP Surgery for Better Explainability with Enhancement in Open-Vocabulary Tasks" (github).
- CLIPN attains SoTA performance in zero-shot OOD detection, all the while inheriting the in-distribution (ID) classification prowess of CLIP.
- CLIPN offers an approach for unsupervised prompt learning using image-text-paired web-dataset.
- Main python libraries of our experimental environment are shown in requirements.txt. You can install CLIPN following:
git clone https://github.com/xmed-lab/CLIPN.git
cd CLIPN
conda create -n CLIPN
conda activate CLIPN
pip install -r ./requirements.txt
- Pre-training Dataset, CC3M. To download CC3M dataset as webdataset, please follow img2dataset.
When you have downloaded CC3M, please re-write your data root into ./src/run.sh.
- OOD detection datasets.
- ID dataset, ImageNet-1K: The ImageNet-1k dataset (ILSVRC-2012) can be downloaded here.
- OOD dataset, iNaturalist, SUN, Places, and Texture. Please follow instruction from these two repositories MOS and MCM to download the subsampled datasets where semantically overlapped classes with ImageNet-1k are removed.
When you have downloaded the above datasets, please re-write your data root into ./src/tuning_util.py.
- Pre-train CLIPN on CC3M. This step is to empower "no" logic within CLIP via the web-dataset.
- The model of CLIPN is defined in ./src/open_clip/model.py. Here, you can find a group of learnable 'no' token embeddings defined in Line 527.
- The function of loading parameters of CLIP is defined in ./src/open_clip/factory.py.
- The loss functions are defined in ./src/open_clip/loss.py.
- You can pre-train CLIPN on ViT-B-32 and ViT-B-16 by:
cd ./src
sh run.sh
- Zero-Shot Evaluate CLIPN on ImageNet-1K.
- Metrics and pipeline are defined in ./src/zero_shot_infer.py. Here you can find three baseline methods, and our two inference algorithms: CTW and ATD (see Line 91-96).
- Dataset details are defined in ./src/tuning_util.py.
- Inference models are defined in ./src/classification.py, including converting the text encoders into classifiers.
- You can download the models provided in the table below or pre-trained by yourself. Then re-write the path of your models in the main function of ./src/zero_shot_infer.py. Finally, evaluate CLIPN by:
python3 zero_shot_infer.py
To ensure the reproducibility of the results, we conducted three repeated experiments under each configuration. The following will exhibit the most recent reproduced results achieved before open-sourcing.
- ImageNet-1K
Methods | Repeat | iNaturalist | SUN | Textures | Places | Avg | Model/log | |||||
AUROC | FPR95 | AUROC | FPR95 | AUROC | FPR95 | AUROC | FPR95 | AUROC | FPR95 | |||
ViT-B-16 | ||||||||||||
CLIPN-CTW | 1 | 93.12 | 26.31 | 88.46 | 37.67 | 79.17 | 57.14 | 86.14 | 43.33 | _ | _ | here |
2 | 93.48 | 21.06 | 89.79 | 30.31 | 83.31 | 46.44 | 88.21 | 33..85 | _ | _ | here | |
3 | 91.79 | 25.84 | 89.76 | 31.30 | 76.76 | 59.25 | 87.66 | 36.58 | _ | _ | here | |
Avg | 92.80 | 24.41 | 89.34 | 33.09 | 79.75 | 54.28 | 87.34 | 37.92 | 87.31 | 37.42 | _ | |
CLIPN-ATD | 1 | 95.65 | 21.73 | 93.22 | 29.51 | 90.35 | 42.89 | 91.25 | 36.98 | _ | _ | here |
2 | 96.67 | 16.71 | 94.77 | 23.41 | 92.46 | 34.73 | 93.39 | 29.24 | _ | _ | here | |
3 | 96.29 | 18.90 | 94.55 | 24.15 | 89.61 | 45.12 | 93.23 | 30.11 | _ | _ | here | |
Avg | 96.20 | 19.11 | 94.18 | 25.69 | 90.81 | 40.91 | 92.62 | 32.11 | 93.45 | 29.46 | _ |
The performance in this table is better than our paper , because that we add an average learnable "no" prompt (see Line 600-616 in ./src/open_clip/model.py).
There are several important factors that could affect the performance:
- Class prompt texts. In the inference period, we need to use prompt texts to get the weights of classifier (see ./src/prompt/prompt.txt). You can hand on the design of high-performance inference prompts for our CLIPN.
- The number of learnable "no" tokens. Now I just define the number of learnable "no" tokens as 16. You can vary it to find an optimal value.
- If you have any ideas to enhance CLIPN or attempt to transfer this idea to other topics, feel free to discuss with me and I am happy to share some ideas with you.
If you find our paper helps you, please kindly consider citing our paper in your publications.
@inproceedings{wang2023clipn,
title={CLIPN for Zero-Shot OOD Detection: Teaching CLIP to Say No},
author={Wang, Hualiang and Li, Yi and Yao, Huifeng and Li, Xiaomeng},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
pages={1802--1812},
year={2023}
}
We sincerely appreciate these three highly valuable repositories open_clip, MOS and MCM.