CLIP is Also an Efficient Segmenter: A Text-Driven Approach for Weakly Supervised Semantic Segmentation (CVPR 2023)

🎉 🎉 🎉 News

2023/12/09 Our new paper TagCLIP: A Local-to-Global Framework to Enhance Open-Vocabulary Multi-Label Classification of CLIP Without Training is accepted by AAAI 2024. It can generate image-level labels based on frozen CLIP and can realize annotation-free semantic segmentation without any training when combining with CLIP-ES.
2023/2/28 Our paper is accepted by CVPR 2023.

Reqirements

# create conda env
conda create -n clip-es python=3.9
conda activate clip-es

# install packages
pip install torch==1.7.1+cu101 torchvision==0.8.2+cu101 -f https://download.pytorch.org/whl/torch_stable.html
pip install opencv-python ftfy regex tqdm ttach tensorboard lxml cython

# install pydensecrf from source
git clone https://github.com/lucasb-eyer/pydensecrf
cd pydensecrf
python setup.py install

Preparing Datasets

PASCAL VOC2012

Download images in PASCAL VOC2012 dataset at here and the train_aug groundtruth at here. The structure of /your_home_dir/datasets/VOC2012should be organized as follows:

---VOC2012/
       --Annotations
       --ImageSets
       --JPEGImages
       --SegmentationClass
       --SegmentationClassAug

MS COCO2014

Download MS COCO images from the official website. Download semantic segmentation annotations for the MS COCO dataset at here. The structure of /your_home_dir/datasets/COCO2014are suggested to be organized as follows:

---COCO2014/
       --Annotations
       --JPEGImages
           -train2014
           -val2014
       --SegmentationClass

Preparing pre-trained model

Download CLIP pre-trained [ViT-B/16] at here and put it to /your_home_dir/pretrained_models/clip.

Usage

Step 1. Generate CAMs for train (train_aug) set.

# For VOC12
CUDA_VISIBLE_DEVICES=0 python generate_cams_voc12.py --img_root /your_home_dir/datasets/VOC2012/JPEGImages --split_file ./voc12/train_aug.txt --model /your_home_dir/pretrained_models/clip/ViT-B-16.pt --num_workers 1 --cam_out_dir ./output/voc12/cams

# For COCO14
CUDA_VISIBLE_DEVICES=0 python generate_cams_coco14.py --img_root /your_home_dir/datasets/COCO2014/JPEGImages/train2014 --split_file ./coco14/train.txt --model /your_home_dir/pretrained_models/clip/ViT-B-16.pt --num_workers 1 --cam_out_dir ./output/coco14/cams

Step 2. Evaluate generated CAMs and use CRF to postprocess

# (optional) evaluate generated CAMs
## for VOC12
python eval_cam.py --cam_out_dir ./output/voc12/cams --cam_type attn_highres --gt_root /your_home_dir/datasets/VOC2012/SegmentationClassAug --split_file ./voc12/train.txt
## for COCO14
python eval_cam.py --cam_out_dir ./output/coco14/cams --cam_type attn_highres --gt_root /your_home_dir/datasets/COCO2014/SegmentationClass --split_file ./coco14/train.txt

# use CRF process to generate pseudo masks 
(realize confidence-guided loss by setting pixels with low confidence to 255)
## for VOC12 
python eval_cam_with_crf.py --cam_out_dir ./output/voc12/cams --gt_root /your_home_dir/datasets/VOC2012/SegmentationClassAug --image_root /your_home_dir/datasets/VOC2012/JPEGImages --split_file ./voc12/train_aug.txt --pseudo_mask_save_path ./output/voc12/pseudo_masks
## for COCO14
python eval_cam_with_crf.py --cam_out_dir ./output/coco14/cams --gt_root /your_home_dir/datasets/COCO2014/SegmentationClass --image_root /your_home_dir/datasets/COCO2014/JPEGImages/train2014 --split_file ./coco14/train.txt --pseudo_mask_save_path ./output/coco2014/pseudo_masks

# eval CRF processed pseudo masks
## for VOC12 
python eval_cam_with_crf.py --cam_out_dir ./output/voc12/cams --gt_root /your_home_dir/datasets/VOC2012/SegmentationClassAug --image_root /your_home_dir/datasets/VOC2012/JPEGImages --split_file ./voc12/train_aug.txt --eval_only
## for COCO14
python eval_cam_with_crf.py --cam_out_dir ./output/coco14/cams --gt_root /your_home_dir/datasets/COCO2014/SegmentationClass --image_root /your_home_dir/datasets/COCO2014/JPEGImages/train2014 --split_file ./coco14/train.txt --eval_only

The generated pseudo masks of VOC12 and COCO14 can be found at Google Drive.

Step 3. Train Segmentation Model

To train DeepLab-v2, we refer to deeplab-pytorch. The ImageNet pre-trained model can be found in AdvCAM.

Results

The quality of generated pseudo masks on PASCAL VOC2012 train set.

Method	CAMs	+CRF
CLIP-ES	70.8	75.0

Segmentation results on PASCAL VOC2012 val and test sets.

Method	Network	Pretrained	val	test
CLIP-ES	DeepLabV2	ImageNet	71.1	71.4
CLIP-ES	DeepLabV2	COCO	73.8	73.9

Segmentation results on MS COCO2014 val set.

Method	Network	Pretrained	val
CLIP-ES	DeepLabV2	ImageNet	45.4

Acknowledgement

We borrowed the code from CLIP and pytorch_grad_cam. Thanks for their wonderful works.

Citation

If you find this project helpful for your research, please consider citing the following BibTeX entry.

@InProceedings{Lin_2023_CVPR,
    author    = {Lin, Yuqi and Chen, Minghao and Wang, Wenxiao and Wu, Boxi and Li, Ke and Lin, Binbin and Liu, Haifeng and He, Xiaofei},
    title     = {CLIP Is Also an Efficient Segmenter: A Text-Driven Approach for Weakly Supervised Semantic Segmentation},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2023},
    pages     = {15305-15314}
}

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
clip		clip
coco14		coco14
pytorch_grad_cam		pytorch_grad_cam
voc12		voc12
.gitignore		.gitignore
CLIP-ES.png		CLIP-ES.png
LICENSE		LICENSE
README.md		README.md
clip_text.py		clip_text.py
eval_cam.py		eval_cam.py
eval_cam_with_crf.py		eval_cam_with_crf.py
generate_cams_coco14.py		generate_cams_coco14.py
generate_cams_voc12.py		generate_cams_voc12.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CLIP is Also an Efficient Segmenter: A Text-Driven Approach for Weakly Supervised Semantic Segmentation (CVPR 2023)

🎉 🎉 🎉 News

Reqirements

Preparing Datasets

PASCAL VOC2012

MS COCO2014

Preparing pre-trained model

Usage

Step 1. Generate CAMs for train (train_aug) set.

Step 2. Evaluate generated CAMs and use CRF to postprocess

Step 3. Train Segmentation Model

Results

The quality of generated pseudo masks on PASCAL VOC2012 train set.

Segmentation results on PASCAL VOC2012 val and test sets.

Segmentation results on MS COCO2014 val set.

Acknowledgement

Citation

About

Releases

Packages

Languages

License

linyq2117/CLIP-ES

Folders and files

Latest commit

History

Repository files navigation

CLIP is Also an Efficient Segmenter: A Text-Driven Approach for Weakly Supervised Semantic Segmentation (CVPR 2023)

🎉 🎉 🎉 News

Reqirements

Preparing Datasets

PASCAL VOC2012

MS COCO2014

Preparing pre-trained model

Usage

Step 1. Generate CAMs for train (train_aug) set.

Step 2. Evaluate generated CAMs and use CRF to postprocess

Step 3. Train Segmentation Model

Results

The quality of generated pseudo masks on PASCAL VOC2012 train set.

Segmentation results on PASCAL VOC2012 val and test sets.

Segmentation results on MS COCO2014 val set.

Acknowledgement

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages