Skip to content

Latest commit

 

History

History
243 lines (213 loc) · 7.74 KB

README.md

File metadata and controls

243 lines (213 loc) · 7.74 KB

CLIPTrase

[ECCV24] Explore the Potential of CLIP for Training-Free Open Vocabulary Semantic Segmentation

1. Introduction

CLIP, as a vision-language model, has significantly advanced Open-Vocabulary Semantic Segmentation (OVSS) with its zero-shot capabilities. Despite its success, its application to OVSS faces challenges due to its initial image-level alignment training, which affects its performance in tasks requiring detailed local context. Our study delves into the impact of CLIP's [CLS] token on patch feature correlations, revealing a dominance of "global" patches that hinders local feature discrimination. To overcome this, we propose CLIPtrase, a novel training-free semantic segmentation strategy that enhances local feature awareness through recalibrated self-correlation among patches. This approach demonstrates notable improvements in segmentation accuracy and the ability to maintain semantic coherence across objects. Experiments show that we are 22.3% ahead of CLIP on average on 9 segmentation benchmarks, outperforming existing state-of-the-art training-free methods.

Full paper and supplementary materials: arxiv

1.1. Global Patch

global patch

1.2. Model Architecture

model architecture

2. Code

2.1. Environments

  • base environment: pytorch==1.12.1, torchvision==0.13.1 (CUDA11.3)
python -m pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/cu113
  • Detectron2 version: install detectron2==0.6 additionally
git clone https://github.com/facebookresearch/detectron2.git
python -m pip install -e detectron2

2.2. Data preparation

  • We follow the detectron2 format of the datasets:

    The specific processing process can refer to MaskFormer and SimSeg

    Update configs/dataset_cfg.py to your own path

datasets/
--coco/
----...
----val2017/
----stuffthingmaps_detectron2/
------val2017/

--VOC2012/
----...
----images_detectron2/
------val/
----annotations_detectron2/
------val/

--pcontext/
----...
----val/
------image/
------label/

----pcontext_full/
----...
----val/
------image/
------label/

--ADEChallengeData2016/
----...
----images/
------validation/
----annotations_detectron2/
------validation/

--ADE20K_2021_17_01/
----...
----images/
------validation/
----annotations_detectron2/
------validation/       
  • You also can use your own dataset, mask sure that it has image and gt file, and the value of each pixel in the gt image is its corresponding label.

2.3. Global patch demo

  • We provide a demo of the global patch in the notebook global_patch_demo.ipynb, where you can visualize the global patch phenomenon mentioned in our paper.

2.4. Training-free OVSS

  • Running with single GPU
python clip_self_correlation.py
  • Running with multiple GPUs in the detectron2 version

    Update: We provide detectron2 framework version, the clip state keys are modified and can be found here, you can download and put it in outputs folder.

    Note: The results of the d2 version are slightly different from those in the paper due to differences in preprocessing and resolution.

python -W ignore train_net.py --eval-only --config-file configs/clip_self_correlation.yaml --num-gpus 4 OUTPUT_DIR your_output_path MODEL.WEIGHTS your_model_path
  • Results

    single 3090, CLIP-B/16, evaluate in 9 situations on COCO, ADE, PASCAL CONTEXT, and VOC.

    Our results do not use any post-processing such as densecrf.

w/o. background w. background
Resolution Metrics coco171 voc20 pc59 pc459 ade150 adefull coco80 voc21 pc60
224 pAcc 38.9 89.68 58.94 44.18 38.57 25.45 50.08 78.63 52.14
mAcc 44.47 91.4 57.08 21.53 39.17 18.78 62.5 84.11 56.08
fwIoU 26.87 82.49 45.28 35.22 27.96 18.99 38.19 67.67 37.61
mIoU 22.84 80.95 33.83 9.36 16.35 6.31 43.56 50.88 29.87
336 pAcc 40.14 89.51 60.15 45.61 39.92 26.73 50.01 79.93 53.21
mAcc 45.09 91.77 57.47 21.26 37.75 17.99 62.55 85.24 56.43
fwIoU 27.96 82.15 46.64 36.66 29.17 20.3 38.24 69.1 38.76
mIoU 24.06 81.2 34.92 9.95 17.04 5.89 44.84 53.04 30.79

Citation

  • If you find this project useful, please consider citing:
@InProceedings{shao2024explore,
    title={Explore the Potential of CLIP for Training-Free Open Vocabulary Semantic Segmentation},
    author={Tong Shao and Zhuotao Tian and Hang Zhao and Jingyong Su},
    booktitle={European Conference on Computer Vision},
    organization={Springer},
    year={2024}
}