Transformer-in-Vision

Some recent Transformer-based CV works. Welcome to comment/contribute!

Updating.

Resource

Attention is all you need, [Paper]
OpenAI CLIP [Page], [Paper], [Code]
OpenAI DALL·E [Page]
huggingface/transformers
Kyubyong/transformer, TF
jadore801120/attention-is-all-you-need-pytorch, Torch
krasserm/fairseq-image-captioning
PyTorch Transformers Tutorials
ictnlp/awesome-transformer
basicv8vc/awesome-transformer
dk-liang/Awesome-Visual-Transformer
yuewang-cuhk/awesome-vision-language-pretraining-papers

Survery:

(arXiv 2020.9) Efficient Transformers: A Survey, PDF
(arXiv 2020.1) Transformers in Vision: A Survey, PDF

Recent Papers

(ICLR'21) UPDET: UNIVERSAL MULTI-AGENT REINFORCEMENT LEARNING VIA POLICY DECOUPLING WITH TRANSFORMERS, [Paper], [Code]
(ICLR'21) Deformable DETR: Deformable Transformers for End-to-End Object Detection, [Paper], [Code]
(ICLR'21) LAMBDANETWORKS: MODELING LONG-RANGE INTERACTIONS WITHOUT ATTENTION, [Paper], [Code]
(ICLR'21) SUPPORT-SET BOTTLENECKS FOR VIDEO-TEXT REPRESENTATION LEARNING, [Paper]
(ICLR'21) COLORIZATION TRANSFORMER, [Paper], [Code]
(ECCV'20) Multi-modal Transformer for Video Retrieval, [Paper]
(ECCV'20) Connecting Vision and Language with Localized Narratives, [Paper]
(ECCV'20) DETR: End-to-End Object Detection with Transformers, [Paper], [Code]
(CVPR'20) Multi-Modality Cross Attention Network for Image and Sentence Matching, [Paper], [Page]
(CVPR'20) Learning Texture Transformer Network for Image Super-Resolution, [Paper], [Code]
(CVPR'20) Speech2Action: Cross-modal Supervision for Action Recognition, [Paper]
(ICPR'20) Transformer Encoder Reasoning Network, [Paper], [Code]
(EMNLP'19) Effective Use of Transformer Networks for Entity Tracking, [Paper], [Code]
(arXiv 2021.02) TransGAN: Two Transformers Can Make One Strong GAN, [Paper], [Code]
(arXiv 2021.02) END-TO-END AUDIO-VISUAL SPEECH RECOGNITION WITH CONFORMERS, [Paper]
(arXiv 2021.02) Is Space-Time Attention All You Need for Video Understanding? [Paper]
(arXiv 2021.02) Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling, [Paper], [Code]
(arXiv 2021.02) Video Transformer Network, [Paper]
(arXiv 2021.02) Training Vision Transformers for Image Retrieval, [Paper]
(arXiv 2021.02) Relaxed Transformer Decoders for Direct Action Proposal Generation, [Paper], [Code]
(arXiv 2021.02) TransReID: Transformer-based Object Re-Identification, [Paper]
(arXiv 2021.02) Improving Visual Reasoning by Exploiting The Knowledge in Texts, [Paper]
(arXiv 2021.01) Fast Convergence of DETR with Spatially Modulated Co-Attention, [Paper]
(arXiv 2021.01) Dual-Level Collaborative Transformer for Image Captioning, [Paper]
(arXiv 2021.01) SSTVOS: Sparse Spatiotemporal Transformers for Video Object Segmentation (arXiv 2021.1), [Paper]
(arXiv 2021.01) CPTR: FULL TRANSFORMER NETWORK FOR IMAGE CAPTIONING, [Paper]
(arXiv 2021.01) Trans2Seg: Transparent Object Segmentation with Transformer, [Paper], [Code]
(arXiv 2021.01) Scheduled Sampling in Vision-Language Pretraining with Decoupled Encoder-Decoder Network, [Paper], [Code]
(arXiv 2021.01) Trear: Transformer-based RGB-D Egocentric Action Recognition, [Paper]
(arXiv 2021.01) Learn to Dance with AIST++: Music Conditioned 3D Dance Generation, [Paper], [Page]
(arXiv 2021.01) Spherical Transformer: Adapting Spherical Signal to CNNs, [Paper]
(arXiv 2021.01) Are We There Yet? Learning to Localize in Embodied Instruction Following, [Paper]
(arXiv 2021.01) VinVL: Making Visual Representations Matter in Vision-Language Models, [Paper]
(arXiv 2021.01) Bottleneck Transformers for Visual Recognition, [Paper]
(arXiv 2021.01) Investigating the Vision Transformer Model for Image Retrieval Tasks, [Paper]
(arXiv 2021.01) ADDRESSING SOME LIMITATIONS OF TRANSFORMERS WITH FEEDBACK MEMORY, [Paper]
(arXiv 2021.01) Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet, [Paper], [Code]
(arXiv 2021.01) TrackFormer: Multi-Object Tracking with Transformers, [Paper]
(arXiv 2021.01) VisualSparta: Sparse Transformer Fragment-level Matching for Large-scale Text-to-Image Search, [Paper]
(arXiv 2021.01) Line Segment Detection Using Transformers without Edges, [Paper]
(arXiv 2021.01) Decoupling the Role of Data, Attention, and Losses in Multimodal Transformers, [Paper]
(arXiv 2020.12) Accurate Word Representations with Universal Visual Guidance, [Paper]
(arXiv 2020.12) DETR for Pedestrian Detection, [Paper]
(arXiv 2020.12) Transformer Interpretability Beyond Attention Visualization, [Paper], [Code]
(arXiv 2020.12) PCT: Point Cloud Transformer, [Paper]
(arXiv 2020.12) TransPose: Towards Explainable Human Pose Estimation by Transformer, [Paper]
(arXiv 2020.12) Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers, [Paper], [Code]
(arXiv 2020.12) Transformer Guided Geometry Model for Flow-Based Unsupervised Visual Odometry, [Paper]
(arXiv 2020.12) Transformer for Image Quality Assessment, [Paper], [Code]
(arXiv 2020.12) TransTrack: Multiple-Object Tracking with Transformer, [Paper], [Code]
(arXiv 2020.12) 3D Object Detection with Pointformer, [Paper]
(arXiv 2020.12) Training data-efficient image transformers & distillation through attention, [Paper]
(arXiv 2020.12) Toward Transformer-Based Object Detection, [Paper]
(arXiv 2020.12) SceneFormer: Indoor Scene Generation with Transformers, [Paper]
(arXiv 2020.12) Point Transformer, [Paper]
(arXiv 2020.12) End-to-End Human Pose and Mesh Reconstruction with Transformers, [Paper]
(arXiv 2020.12) Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting, [Paper]
(arXiv 2020.12) Pre-Trained Image Processing Transformer, [Paper]
(arXiv 2020.12) Taming Transformers for High-Resolution Image Synthesis, [Paper], [Code]
(arXiv 2020.11) End-to-end Lane Shape Prediction with Transformers, [Paper], [Code]
(arXiv 2020.11) UP-DETR: Unsupervised Pre-training for Object Detection with Transformers, [Paper]
(arXiv 2020.11) End-to-End Video Instance Segmentation with Transformers, [Paper]
(arXiv 2020.11) Rethinking Transformer-based Set Prediction for Object Detection, [Paper]
(arXiv 2020.11) General Multi-label Image Classification with Transformers, [[Paper]](https://arxiv.org/pdf/2011.14027}
(arXiv 2020.11) End-to-End Object Detection with Adaptive Clustering Transformer, [Paper]
(arXiv 2020.10) An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, [Paper], [Code]
(arXiv 2020.07) Oscar: Object-Semantics Aligned Pre-training for Vision-and-Language Tasks, [Paper], [Code]
(arXiv 2020.07) Feature Pyramid Transformer, [Paper], [Code]
(arXiv 2020.06) Visual Transformers: Token-based Image Representation and Processing for Computer Vision, [Paper]
(arXiv 2019.08) LXMERT: Learning Cross-Modality Encoder Representations from Transformers, [Paper], [Code]

TODO

V-L representation learning

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Transformer-in-Vision

Resource

Survery:

Recent Papers

TODO

About

Releases

Packages

chaoshengt/Transformer-in-Vision

Folders and files

Latest commit

History

Repository files navigation

Transformer-in-Vision

Resource

Survery:

Recent Papers

TODO

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages