The official source code of our CVPR24 paper TeachCLIP, "Holistic Features are almost Sufficient for Text-to-Video Retrieval".
We used Anaconda to setup a deep learning workspace that supports PyTorch. Run the following script to install all the required packages.
conda create -n TeachCLIP python==3.8 -y
conda activate TeachCLIP
git clone https://github.com/ruc-aimc-lab/TeachCLIP.git
cd TeachCLIP
pip install -r requirements.txt
-
We provide annotations of five datasets and checkpoints of three teachers (X-CLIP, TS2-Net and XPool) trained on five datasets at Google drive. Video captions and data splits are provided in
Annotations
andVideoSet
. -
For raw videos, you can refer to the guides from CLIP4Clip: Data Preparing. Put the videos into the corresponding
VideoData
folder for each dataset. (It is recommended to use symbolic links.)
Before starting to run the code, please organize the downloaded data in the following format: (The Models
and FeatureData
folders will be automatically generated during training and testing, respectively.)
data
├── datasets
│ ├── msrvtt
│ │ ├── Annotations
│ │ │ ├── MSRVTT_data.json
│ │ │ ├── MSRVTT_JSFUSION_test.csv
│ │ │ └── ...
│ │ ├── FeatureData
│ │ ├── Models
│ │ │ └── msrvtt-7k_xclip+ts2net-as-teacher_vit32
│ │ │ ├── run0
│ │ │ └── ...
│ │ ├── QuerySet
│ │ │ ├── msrvtt1k-test-query.txt
│ │ │ ├── msrvtt3k-test-query.txt
│ │ │ └── ...
│ │ └── VideoData
│ │ │ ├── video0.mp4
│ │ │ ├── video1.mp4
│ │ │ └── ...
│ │ └── VideoSet
│ │ ├── msrvtt1k-test.txt
│ │ ├── msrvtt1k-train.txt
│ │ └── ...
│ ├── activitynet
│ ├── didemo
│ ├── msvd
│ └── vatex
└── teacher_checkpoints
├── xclip
│ ├── didemo_xclip_model.bin
│ ├── msrvtt-7k_xclip_model.bin
│ └── ...
├── ts2net
└── xpool
Write the config file before training. Here, we provide a demo config file for each dataset. You can train TeachCLIP on specified GPUs and dataset by using the following command (taking msrvtt-9k
as an example):
CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=2 train.py --config_path configs/msrvtt-9k.yaml
Use the following command to extract video / text features:
bash do_extract_video_feat.sh $test_collection $videoset $model_name
# e.g. bash do_extract_video_feat.sh msrvtt msrvtt1k-test msrvtt/Models/msrvtt-9k_xclip+ts2net-as-teacher_vit32/run0
bash do_extract_text_feat.sh $test_collection $queryset $model_name
# e.g. bash do_extract_text_feat.sh msrvtt msrvtt1k-test-query msrvtt/Models/msrvtt-9k_xclip+ts2net-as-teacher_vit32/run0
After obtaining the text and video features, the evaluation metrics can be calculated using the following instructions:
bash do_eval.sh $test_collection $text_feat_name $video_feat_name $gt_file_name
# e.g. bash do_eval.sh msrvtt msrvtt1k-test-query/msrvtt/msrvtt-9k_xclip+ts2net-as-teacher_vit32/run0 msrvtt1k-test/msrvtt/msrvtt-9k_xclip+ts2net-as-teacher_vit32/run0 msrvtt1k-gt
If you find our method useful in your work, please cite:
@inproceedings{teachclip,
title = {Holistic Features are almost Sufficient for Text-to-Video Retrieval}
author = {Tian, Kaibin and Zhao, Ruixiang and Xin, Zijie and Lan, Bangxiang and Li, Xirong},
year = {2024},
booktitle={CVPR}
}
The implementation of TeachCLIP relies on resources from CLIP4Clip, X-CLIP and XPool. We thank the original authors for their open-sourcing.
If you encounter any issue when running the code, please feel free to reach us either by creating a new issue in the GitHub or by emailing
- Ruixiang Zhao ([email protected])