Efficient Video Transformers with Spatial-Temporal Token Selection

Official PyTorch implementation of STTS, from the following paper:

Efficient Video Transformers with Spatial-Temporal Token Selection, ECCV 2022.

Junke Wang^*,Xitong Yang^*, Hengduo Li, Li Liu, Zuxuan Wu, Yu-Gang Jiang.

Fudan University, University of Maryland, BirenTech Research

We present STTS, a token selection framework that dynamically selects a few informative tokens in both temporal and spatial dimensions conditioned on input video samples.

Model Zoo

MViT with STTS on Kinetics-400

name	acc@1	FLOPs	model
MViT-T⁰_0.9-S⁴_0.9	78.1	56.4	model
MViT-T⁰_0.8-S⁴_0.9	77.9	47.2	model
MViT-T⁰_0.6-S⁴_0.9	77.5	38.1	model
MViT-T⁰_0.5-S⁴_0.7	76.6	23.3	model
MViT-T⁰_0.4-S⁴_0.6	75.6	12.1	model

VideoSwin with STTS on Kinetics-400

name	acc@1	FLOPs	model
VideoSwin-T⁰_0.9	81.9	252.5	model
VideoSwin-T⁰_0.8	81.6	223.4	model
VideoSwin-T⁰_0.6	81.4	181.4	model
VideoSwin-T⁰_0.5	81.1	121.6	model
VideoSwin-T⁰_0.4	80.7	91.4	model

Installation

Please check MViT and VideoSwin for installation instructions and data preparation.

Training and Evaluation

MViT

For both training and evaluation with MViT as backbone, you could use:

cd MViT

python tools/run_net.py --cfg path_to_your_config

For example, to evaluate MViT-T⁰_0.6-S⁴_0.9, run:

python tools/run_net.py --cfg configs/Kinetics/t0_0.6_s4_0.9.yaml

VideoSwin

For training, you could use:

cd VideoSwin

bash tools/dist_train.sh path_to_your_config $NUM_GPUS --checkpoint path_to_your_checkpoint --validate --test-last

while for evaluation, you could use:

bash tools/dist_test.sh path_to_your_config path_to_your_checkpoint $NUM_GPUS --eval top_k_accuracy

For example, to evaluate VideoSwin-T⁰_0.9 on a single node with 8 gpus, run:

cd VideoSwin

bash tools/dist_test.sh configs/Kinetics/t0_0.875.py ./checkpoints/t0_0.875.pth 8 --eval top_k_accuracy

License

This project is released under the MIT license. Please see the LICENSE file for more information.

Citation

If you find this repository helpful, please consider citing:

@inproceedings{wang2021efficient,
  title={Efficient video transformers with spatial-temporal token selection},
  author={Wang, Junke and Yang, Xitong and Li, Hengduo and Li, Liu and Wu, Zuxuan and Jiang, Yu-Gang},
  booktitle={ECCV},
  year={2022}
}

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
MViT		MViT
VideoSwin		VideoSwin
imgs		imgs
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Efficient Video Transformers with Spatial-Temporal Token Selection

Model Zoo

MViT with STTS on Kinetics-400

VideoSwin with STTS on Kinetics-400

Installation

Training and Evaluation

MViT

VideoSwin

License

Citation

About

Releases

Packages

Languages

License

wdrink/STTS

Folders and files

Latest commit

History

Repository files navigation

Efficient Video Transformers with Spatial-Temporal Token Selection

Model Zoo

MViT with STTS on Kinetics-400

VideoSwin with STTS on Kinetics-400

Installation

Training and Evaluation

MViT

VideoSwin

License

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages