CLIPTrase

[ECCV24] Explore the Potential of CLIP for Training-Free Open Vocabulary Semantic Segmentation

1. Introduction

CLIP, as a vision-language model, has significantly advanced Open-Vocabulary Semantic Segmentation (OVSS) with its zero-shot capabilities. Despite its success, its application to OVSS faces challenges due to its initial image-level alignment training, which affects its performance in tasks requiring detailed local context. Our study delves into the impact of CLIP's [CLS] token on patch feature correlations, revealing a dominance of "global" patches that hinders local feature discrimination. To overcome this, we propose CLIPtrase, a novel training-free semantic segmentation strategy that enhances local feature awareness through recalibrated self-correlation among patches. This approach demonstrates notable improvements in segmentation accuracy and the ability to maintain semantic coherence across objects. Experiments show that we are 22.3% ahead of CLIP on average on 9 segmentation benchmarks, outperforming existing state-of-the-art training-free methods.

Full paper and supplementary materials: arxiv

1.1. Global Patch

1.2. Model Architecture

2. Code

2.1. Environments

base environment: pytorch==1.12.1, torchvision==0.13.1 (CUDA11.3)

python -m pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/cu113

Detectron2 version: install detectron2==0.6 additionally

git clone https://github.com/facebookresearch/detectron2.git
python -m pip install -e detectron2

2.2. Data preparation

We follow the detectron2 format of the datasets:

The specific processing process can refer to MaskFormer and SimSeg

Update configs/dataset_cfg.py to your own path

datasets/
--coco/
----...
----val2017/
----stuffthingmaps_detectron2/
------val2017/

--VOC2012/
----...
----images_detectron2/
------val/
----annotations_detectron2/
------val/

--pcontext/
----...
----val/
------image/
------label/

----pcontext_full/
----...
----val/
------image/
------label/

--ADEChallengeData2016/
----...
----images/
------validation/
----annotations_detectron2/
------validation/

--ADE20K_2021_17_01/
----...
----images/
------validation/
----annotations_detectron2/
------validation/

You also can use your own dataset, mask sure that it has image and gt file, and the value of each pixel in the gt image is its corresponding label.

2.3. Global patch demo

We provide a demo of the global patch in the notebook global_patch_demo.ipynb, where you can visualize the global patch phenomenon mentioned in our paper.

2.4. Training-free OVSS

Running with single GPU

python clip_self_correlation.py

Running with multiple GPUs in the detectron2 version

Update: We provide detectron2 framework version, the clip state keys are modified and can be found here, you can download and put it in outputs folder.

Note: The results of the d2 version are slightly different from those in the paper due to differences in preprocessing and resolution.

python -W ignore train_net.py --eval-only --config-file configs/clip_self_correlation.yaml --num-gpus 4 OUTPUT_DIR your_output_path MODEL.WEIGHTS your_model_path

Results

single 3090, CLIP-B/16, evaluate in 9 situations on COCO, ADE, PASCAL CONTEXT, and VOC.

Our results do not use any post-processing such as densecrf.

		w/o. background						w. background
Resolution	Metrics	coco171	voc20	pc59	pc459	ade150	adefull	coco80	voc21	pc60
224	pAcc	38.9	89.68	58.94	44.18	38.57	25.45	50.08	78.63	52.14
	mAcc	44.47	91.4	57.08	21.53	39.17	18.78	62.5	84.11	56.08
	fwIoU	26.87	82.49	45.28	35.22	27.96	18.99	38.19	67.67	37.61
	mIoU	22.84	80.95	33.83	9.36	16.35	6.31	43.56	50.88	29.87
336	pAcc	40.14	89.51	60.15	45.61	39.92	26.73	50.01	79.93	53.21
	mAcc	45.09	91.77	57.47	21.26	37.75	17.99	62.55	85.24	56.43
	fwIoU	27.96	82.15	46.64	36.66	29.17	20.3	38.24	69.1	38.76
	mIoU	24.06	81.2	34.92	9.95	17.04	5.89	44.84	53.04	30.79

Citation

If you find this project useful, please consider citing:

@InProceedings{shao2024explore,
    title={Explore the Potential of CLIP for Training-Free Open Vocabulary Semantic Segmentation},
    author={Tong Shao and Zhuotao Tian and Hang Zhao and Jingyong Su},
    booktitle={European Conference on Computer Vision},
    organization={Springer},
    year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
clip_utils		clip_utils
cliptrase_d2		cliptrase_d2
configs		configs
images		images
LICENSE		LICENSE
README.md		README.md
clip_self_correlation.py		clip_self_correlation.py
global_patch_demo.ipynb		global_patch_demo.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CLIPTrase

[ECCV24] Explore the Potential of CLIP for Training-Free Open Vocabulary Semantic Segmentation

1. Introduction

1.1. Global Patch

1.2. Model Architecture

2. Code

2.1. Environments

2.2. Data preparation

2.3. Global patch demo

2.4. Training-free OVSS

Citation

About

Releases

Packages

Languages

License

leaves162/CLIPtrase

Folders and files

Latest commit

History

Repository files navigation

CLIPTrase

[ECCV24] Explore the Potential of CLIP for Training-Free Open Vocabulary Semantic Segmentation

1. Introduction

1.1. Global Patch

1.2. Model Architecture

2. Code

2.1. Environments

2.2. Data preparation

2.3. Global patch demo

2.4. Training-free OVSS

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages