Official PyTorch implementation of STTS, from the following paper:
Efficient Video Transformers with Spatial-Temporal Token Selection, ECCV 2022.
Junke Wang*,Xitong Yang*, Hengduo Li, Li Liu, Zuxuan Wu, Yu-Gang Jiang.
Fudan University, University of Maryland, BirenTech Research
We present STTS, a token selection framework that dynamically selects a few informative tokens in both temporal and spatial dimensions conditioned on input video samples.
MViT with STTS on Kinetics-400
name | acc@1 | FLOPs | model |
---|---|---|---|
MViT-T00.9-S40.9 | 78.1 | 56.4 | model |
MViT-T00.8-S40.9 | 77.9 | 47.2 | model |
MViT-T00.6-S40.9 | 77.5 | 38.1 | model |
MViT-T00.5-S40.7 | 76.6 | 23.3 | model |
MViT-T00.4-S40.6 | 75.6 | 12.1 | model |
VideoSwin with STTS on Kinetics-400
name | acc@1 | FLOPs | model |
---|---|---|---|
VideoSwin-T00.9 | 81.9 | 252.5 | model |
VideoSwin-T00.8 | 81.6 | 223.4 | model |
VideoSwin-T00.6 | 81.4 | 181.4 | model |
VideoSwin-T00.5 | 81.1 | 121.6 | model |
VideoSwin-T00.4 | 80.7 | 91.4 | model |
Please check MViT and VideoSwin for installation instructions and data preparation.
For both training and evaluation with MViT as backbone, you could use:
cd MViT
python tools/run_net.py --cfg path_to_your_config
For example, to evaluate MViT-T00.6-S40.9, run:
python tools/run_net.py --cfg configs/Kinetics/t0_0.6_s4_0.9.yaml
For training, you could use:
cd VideoSwin
bash tools/dist_train.sh path_to_your_config $NUM_GPUS --checkpoint path_to_your_checkpoint --validate --test-last
while for evaluation, you could use:
bash tools/dist_test.sh path_to_your_config path_to_your_checkpoint $NUM_GPUS --eval top_k_accuracy
For example, to evaluate VideoSwin-T00.9 on a single node with 8 gpus, run:
cd VideoSwin
bash tools/dist_test.sh configs/Kinetics/t0_0.875.py ./checkpoints/t0_0.875.pth 8 --eval top_k_accuracy
This project is released under the MIT license. Please see the LICENSE file for more information.
If you find this repository helpful, please consider citing:
@inproceedings{wang2021efficient,
title={Efficient video transformers with spatial-temporal token selection},
author={Wang, Junke and Yang, Xitong and Li, Hengduo and Li, Liu and Wu, Zuxuan and Jiang, Yu-Gang},
booktitle={ECCV},
year={2022}
}