Skip to content

Source code of our CVPR2024 paper TeachCLIP for Text-to-Video Retrieval

Notifications You must be signed in to change notification settings

ruc-aimc-lab/TeachCLIP

Repository files navigation

Holistic Features are almost Sufficient for Text-to-Video Retrieval

The official source code of our CVPR24 paper TeachCLIP, "Holistic Features are almost Sufficient for Text-to-Video Retrieval".

Environment

We used Anaconda to setup a deep learning workspace that supports PyTorch. Run the following script to install all the required packages.

conda create -n TeachCLIP python==3.8 -y
conda activate TeachCLIP
git clone https://github.com/ruc-aimc-lab/TeachCLIP.git
cd TeachCLIP
pip install -r requirements.txt

Data

Data download

  • We provide annotations of five datasets and checkpoints of three teachers (X-CLIP, TS2-Net and XPool) trained on five datasets at Google drive. Video captions and data splits are provided in Annotations and VideoSet.

  • For raw videos, you can refer to the guides from CLIP4Clip: Data Preparing. Put the videos into the corresponding VideoData folder for each dataset. (It is recommended to use symbolic links.)

Data organization

Before starting to run the code, please organize the downloaded data in the following format: (The Models and FeatureData folders will be automatically generated during training and testing, respectively.)

data
├── datasets
│   ├── msrvtt
│   │   ├── Annotations
│   │   │   ├── MSRVTT_data.json
│   │   │   ├── MSRVTT_JSFUSION_test.csv
│   │   │   └── ...
│   │   ├── FeatureData
│   │   ├── Models
│   │   │   └── msrvtt-7k_xclip+ts2net-as-teacher_vit32
│   │   │       ├── run0
│   │   │       └── ...
│   │   ├── QuerySet
│   │   │   ├── msrvtt1k-test-query.txt
│   │   │   ├── msrvtt3k-test-query.txt
│   │   │   └── ...
│   │   └── VideoData
│   │   │   ├── video0.mp4
│   │   │   ├── video1.mp4
│   │   │   └── ...
│   │   └── VideoSet
│   │       ├── msrvtt1k-test.txt
│   │       ├── msrvtt1k-train.txt
│   │       └── ...
│   ├── activitynet
│   ├── didemo
│   ├── msvd
│   └── vatex
└── teacher_checkpoints
    ├── xclip
    │   ├── didemo_xclip_model.bin
    │   ├── msrvtt-7k_xclip_model.bin
    │   └── ...
    ├── ts2net
    └── xpool

Code

Training

Write the config file before training. Here, we provide a demo config file for each dataset. You can train TeachCLIP on specified GPUs and dataset by using the following command (taking msrvtt-9k as an example):

CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=2 train.py --config_path configs/msrvtt-9k.yaml

Inference

Use the following command to extract video / text features:

bash do_extract_video_feat.sh $test_collection $videoset $model_name
# e.g. bash do_extract_video_feat.sh msrvtt msrvtt1k-test msrvtt/Models/msrvtt-9k_xclip+ts2net-as-teacher_vit32/run0

bash do_extract_text_feat.sh $test_collection $queryset $model_name
# e.g. bash do_extract_text_feat.sh msrvtt msrvtt1k-test-query msrvtt/Models/msrvtt-9k_xclip+ts2net-as-teacher_vit32/run0

Evaluation

After obtaining the text and video features, the evaluation metrics can be calculated using the following instructions:

bash do_eval.sh $test_collection $text_feat_name $video_feat_name $gt_file_name
# e.g. bash do_eval.sh msrvtt msrvtt1k-test-query/msrvtt/msrvtt-9k_xclip+ts2net-as-teacher_vit32/run0 msrvtt1k-test/msrvtt/msrvtt-9k_xclip+ts2net-as-teacher_vit32/run0 msrvtt1k-gt

Citation

If you find our method useful in your work, please cite:

@inproceedings{teachclip,
  title = {Holistic Features are almost Sufficient for Text-to-Video Retrieval}
  author = {Tian, Kaibin and Zhao, Ruixiang and Xin, Zijie and Lan, Bangxiang and Li, Xirong},
  year = {2024},
  booktitle={CVPR}
}

Acknowledgments

The implementation of TeachCLIP relies on resources from CLIP4Clip, X-CLIP and XPool. We thank the original authors for their open-sourcing.

Contact

If you encounter any issue when running the code, please feel free to reach us either by creating a new issue in the GitHub or by emailing

About

Source code of our CVPR2024 paper TeachCLIP for Text-to-Video Retrieval

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published