📌 This is an official PyTorch implementation of CLIFF: Continual Latent Diffusion for Open-Vocabulary Object Detection
CLIFF: Continual Latent Diffusion for Open-Vocabulary Object Detection
Wuyang Li1, Xinyu Liu1, Jiayi Ma2, Yixuan Yuan1
1 The Chinese Univerisity of Hong Kong; 2 Wuhan University
Contact: [email protected]
- [09-26-2024] Code is released
- [08-12-2024] CLIFF is selected as an oral presentation
- [07-14-2024] CLIFF is accepted at ECCV 2024
CLIFF is a probabilistic pipeline modeling the distribution transition among the object, CLIP image, and text subspaces with continual diffusion. Our contributions can be divided into the following aspects:
- Leveraging the diffusion process to model the distribution transfer from the object to the CLIP image and text continually.
- A simple and lightweight latent diffuser with an MLP architecture deployed in the object and CLIP embedding space.
- Efficient diffusion process with only 10 time-steps without obvious runtime sacrifice.
- As a byproduct, CLIFF connects VAE and diffusion models by sampling the object-centric noise
# clone the repo
git clone https://github.com/CUHK-AIM-Group/CLIFF.git
# conda envs
conda create -n cliff python=3.9 -y
conda activate cliff
# [Optionally] check your cuda version and modify accordingly
export CUDA_HOME=/usr/local/cuda-11.3
export PATH=$CUDA_HOME/bin:$PATH
export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH
pip install torch==1.10.0+cu113 torchvision==0.11.0+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html
# install pre-built detectron2
python -m pip install detectron2 -f \
https://dl.fbaipublicfiles.com/detectron2/wheels/cu113/torch1.10/index.html
# install dependence
pip install -r requirements.txt
Please follow the steps in DATASETS.md to prepare the dataset.
Then, change the dataset root _root=$Your dataset root
in ovd/datasets/coco_zeroshot.py
, and change the path in all yaml files configs/xxx.yaml
accordingly.
We release an MVP version of CLIFF for Cifar-10 classification in the folder cifar10, which is much more user-friendly and cute. The general idea is to generate CLIP text embedding with the same diffusor used in CLIFF and measure the Euclidean distance between the generated embedding and CLIP embedding to make the class decision. You can directly transfer this simple version to your project. The code reference is https://github.com/kuangliu/pytorch-cifar.
Feature Extractor | Baseline | Ours |
---|---|---|
ResNet-18 | 93.02% | 95.00% |
You can follow the following steps to train and evaluate the MVP code.
cd cifar10
python train_net_diffusion.py \
--lr 0.1 \
--dir ./experiments
It's worth noting that there may be around a 1.0
SEED | mAPn | mAPb | mAP | Link |
---|---|---|---|---|
9 | 43.07 | 54.53 | 51.54 | Onedrive |
99 | 43.36 | 54.39 | 51.51 | |
999 | 43.44 | 54.44 | 51.51 |
To evaluate the model with 4 GPUs, use the following commands. You can change CUDA_VISIBLE_DEVICES
and num-gpus
to use a different number of GPUs:
CUDA_VISIBLE_DEVICES=1,2,3,4 python train_net_diffusion.py \
--num-gpus 4 \
--config-file /path/to/config/name.yaml \
--eval-only \
MODEL.WEIGHTS /path/to/weight.pth \
SEED 999
For example:
CUDA_VISIBLE_DEVICES=1,2 python train_net_diffusion.py \
--num-gpus 2 \
--config-file configs/CLIFF_COCO_RCNN-C4_obj2img2txt_stage2.yaml \
--eval-only \
MODEL.WEIGHTS ./cliff_model_coco_ovd.pth \
SEED 999
Since I recently changed my working position, I no longer have access to the original GPU resources. We are currently double-checking the training process on other machines after cleaning up the code. We will release the training code as soon as possible. Nonetheless, we have provided the unverified script for the train.sh
.
Greatly appreciate the tremendous effort for the following projects!
- Release code for CLIFF
- Release the MVP version for CLIFF
- Release the verified training scripts
- Release the code and model for LVIS setting
If you think our work is helpful for your project, I would greatly appreciate it if you could consdier citing our work
@inproceedings{Li2024cliff,
title={CLIFF: Continual Latent Diffusion for Open-Vocabulary Object Detection},
author={Wuyang Li, Xinyu Liu, Jiayi Ma, Yixuan Yuan},
booktitle={ECCV},
year={2024}
}