Holistic Features are almost Sufficient for Text-to-Video Retrieval

The official source code of our CVPR24 paper TeachCLIP, "Holistic Features are almost Sufficient for Text-to-Video Retrieval".

Environment

We used Anaconda to setup a deep learning workspace that supports PyTorch. Run the following script to install all the required packages.

conda create -n TeachCLIP python==3.8 -y
conda activate TeachCLIP
git clone https://github.com/ruc-aimc-lab/TeachCLIP.git
cd TeachCLIP
pip install -r requirements.txt

Data

Data download

We provide annotations of five datasets and checkpoints of three teachers (X-CLIP, TS2-Net and XPool) trained on five datasets at Google drive. Video captions and data splits are provided in Annotations and VideoSet.
For raw videos, you can refer to the guides from CLIP4Clip: Data Preparing. Put the videos into the corresponding VideoData folder for each dataset. (It is recommended to use symbolic links.)

Data organization

Before starting to run the code, please organize the downloaded data in the following format: (The Models and FeatureData folders will be automatically generated during training and testing, respectively.)

data
├── datasets
│   ├── msrvtt
│   │   ├── Annotations
│   │   │   ├── MSRVTT_data.json
│   │   │   ├── MSRVTT_JSFUSION_test.csv
│   │   │   └── ...
│   │   ├── FeatureData
│   │   ├── Models
│   │   │   └── msrvtt-7k_xclip+ts2net-as-teacher_vit32
│   │   │       ├── run0
│   │   │       └── ...
│   │   ├── QuerySet
│   │   │   ├── msrvtt1k-test-query.txt
│   │   │   ├── msrvtt3k-test-query.txt
│   │   │   └── ...
│   │   └── VideoData
│   │   │   ├── video0.mp4
│   │   │   ├── video1.mp4
│   │   │   └── ...
│   │   └── VideoSet
│   │       ├── msrvtt1k-test.txt
│   │       ├── msrvtt1k-train.txt
│   │       └── ...
│   ├── activitynet
│   ├── didemo
│   ├── msvd
│   └── vatex
└── teacher_checkpoints
    ├── xclip
    │   ├── didemo_xclip_model.bin
    │   ├── msrvtt-7k_xclip_model.bin
    │   └── ...
    ├── ts2net
    └── xpool

Code

Training

Write the config file before training. Here, we provide a demo config file for each dataset. You can train TeachCLIP on specified GPUs and dataset by using the following command (taking msrvtt-9k as an example):

CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=2 train.py --config_path configs/msrvtt-9k.yaml

Inference

Use the following command to extract video / text features:

bash do_extract_video_feat.sh $test_collection $videoset $model_name
# e.g. bash do_extract_video_feat.sh msrvtt msrvtt1k-test msrvtt/Models/msrvtt-9k_xclip+ts2net-as-teacher_vit32/run0

bash do_extract_text_feat.sh $test_collection $queryset $model_name
# e.g. bash do_extract_text_feat.sh msrvtt msrvtt1k-test-query msrvtt/Models/msrvtt-9k_xclip+ts2net-as-teacher_vit32/run0

Evaluation

After obtaining the text and video features, the evaluation metrics can be calculated using the following instructions:

bash do_eval.sh $test_collection $text_feat_name $video_feat_name $gt_file_name
# e.g. bash do_eval.sh msrvtt msrvtt1k-test-query/msrvtt/msrvtt-9k_xclip+ts2net-as-teacher_vit32/run0 msrvtt1k-test/msrvtt/msrvtt-9k_xclip+ts2net-as-teacher_vit32/run0 msrvtt1k-gt

Citation

If you find our method useful in your work, please cite:

@inproceedings{teachclip,
  title = {Holistic Features are almost Sufficient for Text-to-Video Retrieval}
  author = {Tian, Kaibin and Zhao, Ruixiang and Xin, Zijie and Lan, Bangxiang and Li, Xirong},
  year = {2024},
  booktitle={CVPR}
}

Acknowledgments

The implementation of TeachCLIP relies on resources from CLIP4Clip, X-CLIP and XPool. We thank the original authors for their open-sourcing.

Contact

If you encounter any issue when running the code, please feel free to reach us either by creating a new issue in the GitHub or by emailing

Ruixiang Zhao ([email protected])

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
configs		configs
dataloaders		dataloaders
images		images
modules		modules
utils		utils
.gitignore		.gitignore
README.md		README.md
do_eval.sh		do_eval.sh
do_extract_text_feat.sh		do_extract_text_feat.sh
do_extract_video_feat.sh		do_extract_video_feat.sh
evaluation.py		evaluation.py
extract_feat.py		extract_feat.py
metrics.py		metrics.py
requirements.txt		requirements.txt
train.py		train.py
util.py		util.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Holistic Features are almost Sufficient for Text-to-Video Retrieval

Environment

Data

Data download

Data organization

Code

Training

Inference

Evaluation

Citation

Acknowledgments

Contact

About

Releases

Packages

Contributors 2

Languages

ruc-aimc-lab/TeachCLIP

Folders and files

Latest commit

History

Repository files navigation

Holistic Features are almost Sufficient for Text-to-Video Retrieval

Environment

Data

Data download

Data organization

Code

Training

Inference

Evaluation

Citation

Acknowledgments

Contact

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages