CLIP, as a vision-language model, has significantly advanced Open-Vocabulary Semantic Segmentation (OVSS) with its zero-shot capabilities. Despite its success, its application to OVSS faces challenges due to its initial image-level alignment training, which affects its performance in tasks requiring detailed local context. Our study delves into the impact of CLIP's [CLS] token on patch feature correlations, revealing a dominance of "global" patches that hinders local feature discrimination. To overcome this, we propose CLIPtrase, a novel training-free semantic segmentation strategy that enhances local feature awareness through recalibrated self-correlation among patches. This approach demonstrates notable improvements in segmentation accuracy and the ability to maintain semantic coherence across objects. Experiments show that we are 22.3% ahead of CLIP on average on 9 segmentation benchmarks, outperforming existing state-of-the-art training-free methods.
Full paper and supplementary materials: arxiv
- base environment: pytorch==1.12.1, torchvision==0.13.1 (CUDA11.3)
python -m pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/cu113
- Detectron2 version: install detectron2==0.6 additionally
git clone https://github.com/facebookresearch/detectron2.git
python -m pip install -e detectron2
-
We follow the detectron2 format of the datasets:
The specific processing process can refer to MaskFormer and SimSeg
Update
configs/dataset_cfg.py
to your own path
datasets/
--coco/
----...
----val2017/
----stuffthingmaps_detectron2/
------val2017/
--VOC2012/
----...
----images_detectron2/
------val/
----annotations_detectron2/
------val/
--pcontext/
----...
----val/
------image/
------label/
----pcontext_full/
----...
----val/
------image/
------label/
--ADEChallengeData2016/
----...
----images/
------validation/
----annotations_detectron2/
------validation/
--ADE20K_2021_17_01/
----...
----images/
------validation/
----annotations_detectron2/
------validation/
- You also can use your own dataset, mask sure that it has
image
andgt
file, and the value of each pixel in the gt image is its corresponding label.
- We provide a demo of the global patch in the notebook
global_patch_demo.ipynb
, where you can visualize the global patch phenomenon mentioned in our paper.
- Running with single GPU
python clip_self_correlation.py
-
Running with multiple GPUs in the detectron2 version
Update: We provide detectron2 framework version, the clip state keys are modified and can be found here, you can download and put it in
outputs
folder.Note: The results of the d2 version are slightly different from those in the paper due to differences in preprocessing and resolution.
python -W ignore train_net.py --eval-only --config-file configs/clip_self_correlation.yaml --num-gpus 4 OUTPUT_DIR your_output_path MODEL.WEIGHTS your_model_path
-
Results
single 3090, CLIP-B/16, evaluate in 9 situations on COCO, ADE, PASCAL CONTEXT, and VOC.
Our results do not use any post-processing such as densecrf.
w/o. background | w. background | |||||||||
Resolution | Metrics | coco171 | voc20 | pc59 | pc459 | ade150 | adefull | coco80 | voc21 | pc60 |
224 | pAcc | 38.9 | 89.68 | 58.94 | 44.18 | 38.57 | 25.45 | 50.08 | 78.63 | 52.14 |
mAcc | 44.47 | 91.4 | 57.08 | 21.53 | 39.17 | 18.78 | 62.5 | 84.11 | 56.08 | |
fwIoU | 26.87 | 82.49 | 45.28 | 35.22 | 27.96 | 18.99 | 38.19 | 67.67 | 37.61 | |
mIoU | 22.84 | 80.95 | 33.83 | 9.36 | 16.35 | 6.31 | 43.56 | 50.88 | 29.87 | |
336 | pAcc | 40.14 | 89.51 | 60.15 | 45.61 | 39.92 | 26.73 | 50.01 | 79.93 | 53.21 |
mAcc | 45.09 | 91.77 | 57.47 | 21.26 | 37.75 | 17.99 | 62.55 | 85.24 | 56.43 | |
fwIoU | 27.96 | 82.15 | 46.64 | 36.66 | 29.17 | 20.3 | 38.24 | 69.1 | 38.76 | |
mIoU | 24.06 | 81.2 | 34.92 | 9.95 | 17.04 | 5.89 | 44.84 | 53.04 | 30.79 |
- If you find this project useful, please consider citing:
@InProceedings{shao2024explore,
title={Explore the Potential of CLIP for Training-Free Open Vocabulary Semantic Segmentation},
author={Tong Shao and Zhuotao Tian and Hang Zhao and Jingyong Su},
booktitle={European Conference on Computer Vision},
organization={Springer},
year={2024}
}