No. | Model Name | Title | Links | Pub. | Organization | Release Time |
---|---|---|---|---|---|---|
1 | TimeSformer | Is Space-Time Attention All You Need for Video Understanding? | paper code | arXiv | Facebook AI | 24 Feb 2021 |
2 | Video Transformer | Video Transformer Network | paper | arXiv | Theator | 1 Feb 2021 |
3 | ViViT | ViViT: A Video Vision Transformer | paper | arXiv | Google AI | 29 Mar 2021 |
4 | VideoGPT | VideoGPT: Video Generation using VQ-VAE and Transformers | paper code | arXiv | UC Berkeley | 20 Apr 2021 |
5 | VIMPAC | VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive Learning | paper code | arXiv | UNC | 21 June 2021 |
6 | - | Self-supervised Video Representation Learning by Context and Motion Decoupling | paper | CVPR 2021 | Alibaba | 2 April 2021 |
7 | VideoLightFormer | VideoLightFormer: Lightweight Action Recognition using Transformers | paper | arXiv | the university of shefield | 1 Jul 2021 |
8 | Video Swin Transformer | Video Swin Transformer | paper code | arXiv | MSRA | 24 Jun 2021 |
9 | ST Swin | Long-Short Temporal Contrastive Learning of Video Transformers | paper | arXiv | Facebook AI | 17 Jun 2021 |
10 | X-ViT | Space-time Mixing Attention for Video Transformer | paper | arXiv | Samsung AI Cambridge | 11 Jun 2021 |
11 | OCVT | Generative Video Transformer: Can Objects be the Words? | paper | ICML 2021 | Rutgers University | 20 Jul 2021 |
12 | - | An Image is Worth 16x16 Words, What is a Video Worth? | paper code | arXiv | Alibaba | 27 May 2021 |
13 | SCT | Shifted Chunk Transformer for Spatio-Temporal Representational Learning | paper | arXiv | Kuaishou Technology | 26 Aug 2021 |
14 | - | Evaluating Transformers for Lightweight Action Recognition | paper | arXiv | University of Sheffield | 18 Nov 2021 |
15 | DualFormer | DualFormer: Local-Global Stratified Transformer for Efficient Video Recognition | paper | arXiv | Sea AI Lab | 9 Dec 2021 |
16 | BEVT | BEVT: BERT Pretraining of Video Transformers | paper | arXiv | Shanghai Key Lab of Intelligent Information Processing | 2 Dec 2021 |
17 | - | Efficient Video Transformers with Spatial-Temporal Token Selection | paper | arXiv | Shanghai Key Lab of Intelligent Information Processing | 23 Nov 2021 |
18 | - | Lite Vision Transformer with Enhanced Self-Attention | paper code | arXiv | Johns Hopkins University | 20 Dec 2021 |
19 | MViT | Multiscale Vision Transformers | paper code | ICCV 2021 | 22 Apr 2021 | |
20 | Uniformer | Uniformer: Unified Transformer For Efficient Spatiotemporal Representation Learning | paper code | arXiv | Chinese Academy of Sciences | 12 Jan 2022 |
21 | MaskFeat | Masked Feature Prediction for Self-Supervised Visual Pre-Training | paper | arXiv | Facebook AI | 16 Dec 2021 |
22 | MTV | Multiview Transformers for Video Recognition | paper | arXiv | 20 Jan 2022 | |
23 | MeMViT | MeMViT : Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition | paper | arXiv | Facebook AI Research | 20 Jan 2022 |