Mug-STAN

Official PyTorch implementation of the paper "Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge Transferring" and "Mug-STAN: Adapting Image-Language Pretrained Models for General Video Understanding"

The original code is based on mmcv1.4. Due to all the data processing pipelines being built on private I/O, the training code cannot be open-sourced. Therefore, we have reproduced the results using mmcv2.0.

Pretrained Weights:

Getting Started

Installation

Git clone our repository, creating a python environment and activate it via the following command

git clone https://github.com/farewellthree/STAN.git
cd STAN
conda create --name stan python=3.10
conda activate stan
bash install.sh

Prepare Datasets

You can follow CLIP4clip for the acquisition of videos and annotation.

Once the dataset is already, set the path in each config. Take stan-b/32 on MSRVTT for instance, set video path here at Line 25.

Considering there might be multiple versions of annotations for the dataset, our code may not be compatible with your annotations. In such cases, you just need to modify the corresponding dataset class in video_text_dataset.py, to output the paths of all videos along with their corresponding captions.

Training

STAN

To train stan-b/32 on MSRVTT, run

torchrun --nproc_per_node=8 --master_port=20001 tools/train.py configs/exp/stan/stan_msrvtt_b32_hf.py --launcher pytorch

The same principle applies to other datasets or models in terms of scale.

Mug-STAN

To train mug-stan-b/32 on MSRVTT, run

torchrun --nproc_per_node=8 --master_port=20001 tools/train.py configs/exp/stan/mugstan_msrvt_b32_hf.py --launcher pytorch

The same principle applies to other datasets or models in terms of scale.

Post-Pretraining

To post-pretraining mug-stan-b/32 on Webvid10m, run

torchrun --nproc_per_node=16 --master_port=20001 tools/train.py configs/exp/stan/mugstan_webvid10m_b32_pretrain.py --launcher pytorch

Citation

If you find the code useful for your research, please consider citing our paper:

@article{liu2023revisiting,
  title={Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge Transferring},
  author={Liu, Ruyang and Huang, Jingjia and Li, Ge and Feng, Jiashi and Wu, Xinglong and Li, Thomas H},
  journal={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2023}
}

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
configs		configs
mmaction		mmaction
projects		projects
requirements		requirements
tests		tests
tools		tools
.DS_Store		.DS_Store
LICENSE		LICENSE
README.md		README.md
README_mmaction.md		README_mmaction.md
bpe_simple_vocab_16e6.txt.gz		bpe_simple_vocab_16e6.txt.gz
install.sh		install.sh
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Mug-STAN

Getting Started

Installation

Prepare Datasets

Training

STAN

Mug-STAN

Post-Pretraining

Citation

About

Releases

Packages

Languages

License

farewellthree/STAN

Folders and files

Latest commit

History

Repository files navigation

Mug-STAN

Getting Started

Installation

Prepare Datasets

Training

STAN

Mug-STAN

Post-Pretraining

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages