A curated list of up-to-date papers on token + {reduction, pruning, merging, compression, sparsification} in transformers, large language models, large vision-language models, diffusion models, etc.
- Exploring the Role of Explicit Temporal Modeling in Multimodal Large Language Models for Video Understanding
- Contextual Reinforcement in Multimodal Token Compression for Large Language Models
- Dynamic Token Reduction during Generation for Vision Language Models
- LVPruning: An Effective yet Simple Language-Guided Vision Token Pruning Approach for Multi-modal Large Language Models
- InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling
- AdaFV: Accelerating VLMs with Self-Adaptive Cross-Modality Attention Mixture
- VASparse: Towards Efficient Visual Hallucination Mitigation for Large Vision-Language Model via Visual-Aware Sparsification
- Compression with Global Guidance: Towards Training-free High-Resolution MLLMs Acceleration
- LLaVA-Octopus: Unlocking Instruction-Driven Adaptive Projector Fusion for Video Understanding
- TREAD: Token Routing for Efficient Architecture-agnostic Diffusion Training
- LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token
- FOLDER: Accelerating Multi-modal Large Language Models with Enhanced Performance
- What Kind of Visual Tokens Do We Need? Training-free Visual Token Pruning for Multi-modal Large Language Models from the Perspective of Graph
- FrameFusion: Combining Similarity and Importance for Video Token Reduction on Large Visual Language Models
- Cached Adaptive Token Merging: Dynamic Token Reduction and Redundant Computation Elimination in Diffusion Model
- VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling
- Token Pruning for Caching Better: 9Γ Acceleration on Stable Diffusion for Free
- [ICASSP 2025] Cross-Layer Cache Aggregation for Token Reduction in Ultra-Fine-Grained Image Recognition
- [AAAI 2025] Training-Free and Hardware-Friendly Acceleration for Diffusion Models via Similarity-based Token Pruning
- [TPAMI 2024] Hierarchical Banzhaf Interaction for General Video-Language Representation Learning
- ReTaKe: Reducing Temporal and Knowledge Redundancy for Long Video Understanding
- [AAAI 2025] ST3: Accelerating Multimodal Large Language Model by Spatial-Temporal Visual Token Trimming
- Layer- and Timestep-Adaptive Differentiable Token Compression Ratios for Efficient Diffusion Transformers
- [AAAI 2025] ImagePiece: Content-aware Re-tokenization for Efficient Image Recognition
- PruneVid: Visual Token Pruning for Efficient Video Large Language Models
- Incorporating Feature Pyramid Tokenization and Open Vocabulary Semantic Segmentation
- FastVLM: Efficient Vision Encoding for Vision Language Models
- Feather the Throttle: Revisiting Visual Token Pruning for Vision-Language Model Acceleration
- Faster Vision Mamba is Rebuilt in Minutes via Merged Token Re-training
- SpeechPrune: Context-aware Token Pruning for Speech Information Retrieval
- AsymRnR: Video Diffusion Transformers Acceleration with Asymmetric Reduction and Restoration
- FTP: A Fine-grained Token-wise Pruner for Large Language Models via Token Routing
- OmniVLM: A Token-Compressed, Sub-Billion-Parameter Vision-Language Model for Efficient On-Device Inference
- [AAAI 2025] Attention-driven GUI Grounding: Leveraging Pretrained Multimodal Large Language Models without Fine-Tuning
- Memory Efficient Matting with Adaptive Token Routing
- [NeurIPS 2024] Learning to Merge Tokens via Decoupled Embedding for Efficient Vision Transformers
- FovealNet: Advancing AI-Driven Gaze Tracking Solutions for Optimized Foveated Rendering System Performance in Virtual Reality
- SweetTokenizer: Semantic-Aware Spatial-Temporal Tokenizer for Compact Visual Discretization
- B-VLLM: A Vision Large Language Model with Balanced Spatio-Temporal Tokens
- PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models
- Dynamic-VLM: Simple Dynamic Visual Token Compression for VideoLLM
- Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition
- LLaVA-Zip: Adaptive Visual Token Compression with Intrinsic Image Information
- TRIM: Token Reduction and Inference Modeling for Cost-Effective Language Generation
- RADIO Amplified: Improved Baselines for Agglomerative Vision Foundation Models
- SafeWatch: An Efficient Safety-Policy Following Video Guardrail Model with Transparent Explanations
- Pruning All-Rounder: Rethinking and Improving Inference Efficiency for Large Vision Language Models
- iLLaVA: An Image is Worth Fewer Than 1/3 Input Tokens in Large Multimodal Models
- [CLS] Token Tells Everything Needed for Training-free Efficient MLLMs
- VisionZip: Longer is Better but Not Necessary in Vision Language Models
- p-MoD: Building Mixture-of-Depths MLLMs via Progressive Ratio Decay
- A Stitch in Time Saves Nine: Small VLM is a Precise Guidance for Accelerating Large VLMs
- AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning
- 3D Representation in 512-Byte: Variational Tokenizer is the Key for Autoregressive 3D Generation
- [CLS] Attention is All You Need for Training-Free Visual Token Pruning: Make VLM Inference Faster
- Negative Token Merging: Image-based Adversarial Feature Guidance
- Token Cropr: Faster ViTs for Quite a Few Tasks
- Accelerating Multimodal Large Language Models by Searching Optimal Vision Token Reduction
- ATP-LLaVA: Adaptive Token Pruning for Large Vision Language Models
- Accelerating Multimodal Large Language Models via Dynamic Visual-Token Exit and the Empirical Findings
- TimeMarker: A Versatile Video-LLM for Long and Short Video Understanding with Superior Temporal Localization Ability
- Training Noise Token Pruning
- Efficient Multi-modal Large Language Models via Visual Token Grouping
- Rethinking Token Reduction in MLLMs: Towards a Unified Paradigm for Training-Free Acceleration
- Attamba: Attending To Multi-Token States
- [NeurIPSW 2024] ShowUI: One Vision-Language-Action Model for GUI Visual Agent
- Importance-based Token Merging for Diffusion Models
- Enhancing Instruction-Following Capability of Visual-Language Models by Reducing Image Redundancy
- freePruner: A Training-free Approach for Large Multimodal Model Acceleration
- Efficient Online Inference of Vision Transformers by Training-Free Tokenization
- DyCoke: Dynamic Compression of Tokens for Fast Video Large Language Models
- LLaVA-MR: Large Language-and-Vision Assistant for Video Moment Retrieval
- FOCUS: Knowledge-enhanced Adaptive Visual Compression for Few-shot Whole Slide Image Classification
- Beyond Training: Dynamic Token Merging for Zero-Shot Video Understanding
- FocusLLaVA: A Coarse-to-Fine Approach for Efficient and Effective Visual Token Compression
- FoPru: Focal Pruning for Efficient Large Vision-Language Models
- Principles of Visual Tokens for Efficient Video Understanding
- LaVida Drive: Vision-Text Interaction VLM for Autonomous Driving with Token Selection, Recovery and Enhancement
- TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models
- Multi-Stage Vision Token Dropping: Towards Efficient Multimodal Large Language Model
- Multidimensional Byte Pair Encoding: Shortened Sequences for Improved Visual Data Generation
- [NeurIPS 2024] Token Merging for Training-Free Semantic Binding in Text-to-Image Synthesis
- [NeurIPS 2024 Spotlight] Donβt Look Twice: Faster Video Transformers with Run-Length Tokenization
- Inference Optimal VLMs Need Only One Visual Token but Larger Models
- PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance
- [NeurIPS 2024] EDT: An Efficient Diffusion Transformer Framework Inspired by Human-like Sketching
- [NeurIPS 2024] Video Token Merging for Long-form Video Understanding
- MrT5: Dynamic Token Merging for Efficient Byte-level Language Models
- Rethinking Visual Dependency in Long-Context Reasoning for Large Vision-Language Models
- LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding
- PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction
- xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs
- CAGE: Causal Attention Enables Data-Efficient Generalizable Robotic Manipulation
- Is Less More? Exploring Token Condensation as Training-free Adaptation for CLIP
- [EMNLP 2024] Rethinking Token Reduction for State Space Models
- Efficient Vision-Language Models by Summarizing Visual Tokens into Compact Registers
- Long-LRM: Long-sequence Large Reconstruction Model for Wide-coverage Gaussian Splats
- VidCompress: Memory-Enhanced Temporal Compression for Video Understanding in Large Language Models
- big.LITTLE Vision Transformer for Efficient Visual Recognition
- [NeurIPSW 2024] Token Pruning using a Lightweight Background Aware Vision Transformer
- ZipVL: Efficient Large Vision-Language Models with Dynamic Token Sparsification
- PAR: Prompt-Aware Token Reduction Method for Efficient Large Multimodal Models
- Rodimus*: Breaking the Accuracy-Efficiency Trade-Off with Efficient Attentions
- Treat Visual Tokens as Text? But Your MLLM Only Needs Fewer Efforts to See
- TextHawk2: A Large Vision-Language Model Excels in Bilingual OCR and Grounding with 16x Fewer Tokens
- SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference
- [EMNLP 2024 Findings] From Reading to Compressing: Exploring the Multi-document Reader for Prompt Compression
- [ICLR 2025] Dynamic Diffusion Transformer
- AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark
- [EMNLP 2024] FastAdaSP: Multitask-Adapted Efficient Inference for Large Speech Language Model
- AVG-LLaVA: A Large Multimodal Model with Adaptive Visual Granularity
- Cut the Crap: An Economical Communication Pipeline for LLM-based Multi-Agent Systems
- Locret: Enhancing Eviction in Long-Context LLM Inference with Trained Retaining Heads
- VMAD: Visual-enhanced Multimodal Large Language Model for Zero-Shot Anomaly Detection
- [NeurIPS 2024] One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos
- [NeurIPS 2024] Exploring Token Pruning in Vision State Space Models
- Token Caching for Diffusion Transformer Acceleration
- Discovering the Gems in Early Layers: Accelerating Long-Context LLMs with 1000x Input Token Reduction
- [WACV 2025] Patch Ranking: Token Pruning as Ranking Prediction for Efficient CLIP
- Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding
- [ECCV 2024] Agglomerative Token Clustering
- Video Token Sparsification for Efficient Multimodal LLMs in Autonomous Driving
- [COLING 2025] Less is More: A Simple yet Effective Token Reduction Method for Efficient Multi-modal LLMs
- CSKV: Training-Efficient Channel Shrinking for KV Cache in Long-Context Scenarios
- [AAAI 2025] Fit and Prune: Fast and Training-free Visual Token Pruning for Multi-modal Large Language Models
- [ECCVW 2024] Famba-V: Fast Vision Mamba with Cross-Layer Token Fusion
- TG-LLaVA: Text Guided LLaVA via Learnable Latent Embeddings
- [WACV 2025] VLTP: Vision-Language Guided Token Pruning for Task-Oriented Segmentation
- Enhancing Long Video Understanding via Hierarchical Event-Based Memory
- Qihoo-T2X: An Efficient Proxy-Tokenized Diffusion Transformer for Text-to-Any-Task
- mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page Document Understanding
- TC-LLaVA: Rethinking the Transfer from Image to Video Understanding with Temporal Consideration
- LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via a Hybrid Architecture
- [AAAI 2025] Recoverable Compression: A Multimodal Vision Token Recovery Mechanism Guided by Text Information
- Balancing Performance and Efficiency: A Multimodal Large Language Model Pruning Method based on Image Text Interaction
- [ICLR 2025] TempMe: Video Temporal Token Merging for Efficient Text-Video Retrieval
- [ECCV 2024] Make Your ViT-based Multi-view 3D Detectors Faster via Token Compression
- Vote&Mix: Plug-and-Play Token Reduction for Efficient Vision Transformer
- TReX- Reusing Vision Transformer's Attention for Efficient Xbar-based Computing
- AT-SNN: Adaptive Tokens for Vision Transformer on Spiking Neural Network
- Practical token pruning for foundation models in few-shot conversational virtual assistant systems
- [AAAI 2025] HiRED: Attention-Guided Token Dropping for Efficient Inference of High-Resolution Vision-Language Models in Resource-Constrained Environments
- Dynamic and Compressive Adaptation of Transformers From Images to Videos
- [ECCV 2024] Token Compensator: Altering Inference Cost of Vision Transformer without Re-Tuning
- [ACMMM 2024] ReToMe-VA: Recursive Token Merging for Video Diffusion-based Unrestricted Adversarial Attack
- A2SF: Accumulative Attention Scoring with Forgetting Factor for Token Pruning in Transformer Decoder
- Exploring The Neural Burden In Pruned Models: An Insight Inspired By Neuroscience
- [NeurIPS 2024] Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data?
- SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models
- Efficient Visual Transformer by Learnable Token Merging
- Token-level Correlation-guided Compression for Efficient Multimodal Document Understanding
- LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference
- Pose-guided Multi-task Video Transformer for Driver Action Recognition
- [ECCV 2024] LookupViT: Compressing Visual Information to a Limited Number of Tokens
- [TPAMI 2024] TCFormer: Visual Recognition via Token Clustering Transformer
- [ECCV 2024] Turbo: Informativity-Driven Acceleration Plug-In for Vision-Language Large Models
- [ECCV 2024] GTPT: Group-based Token Pruning Transformer for Efficient Human Pose Estimation
- [Interpeech] LiteFocus: Accelerated Diffusion Inference for Long Audio Synthesis
- [ICMLW 2024] Characterizing Prompt Compression Methods for Long Context Inference
- HiRes-LLaVA: Restoring Fragmentation Input in High-Resolution Large Vision-Language Models
- Pruning One More Token is Enough: Leveraging Latency-Workload Non-Linearities for Vision Transformers on the Edge
- PRANCE: Joint Token-Optimization and Structural Channel-Pruning for Adaptive ViT Inference
- ALPINE: An Adaptive Language-Agnostic Pruning Method for Language Models for Code
- TokenPacker: Efficient Visual Projector for Multimodal LLM
- [ECCV 2024] LPViT: Low-Power Semi-structured Pruning for Vision Transformers
- [ACL 2024 Findings] Concise and Precise Context Compression for Tool-Using Language Models
- DiffIR2VR-Zero: Zero-Shot Video Restoration with Diffusion-based Image Restoration Models
- [ICASSP 2023] Papez: Resource-Efficient Speech Separation with Auditory Working Memory
- [NeurIPS 2024] Efficient Large Multi-modal Models via Visual Context Compression
- [AAAI 2025] DocKylin: A Large Multimodal Model for Visual Document Understanding with Efficient Visual Slimming
- [CVPR 2024] ScanFormer: Referring Expression Comprehension by Iteratively Scanning
- [ICLR 2025] Dynamic Discriminative Operations for Efficient Generative Inference of Large Language Models
- [EMNLP 2024] Bridging Local Details and Global Context in Text-Attributed Graphs
- [EMNLP 2024] Attention Score is not All You Need for Token Importance Indicator in KV Cache Reduction: Value Also Matters
- VoCo-LLaMA: Towards Vision Compression with Large Language Models
- Refiner: Restructure Retrieval Content Efficiently to Advance Question-Answering Capabilities
- [CVPR 2024] ALGM: Adaptive Local-then-Global Token Merging for Efficient Semantic Segmentation with Plain Vision Transformers
- SViTT-Ego: A Sparse Video-Text Transformer for Egocentric Video
- [NeurIPS 2024] COVE: Unleashing the Diffusion Feature Correspondence for Consistent Video Editing
- [CVPRW 2024] ToSA: Token Selective Attention for Efficient Vision Transformers
- [Interspeech 2024] FastAST: Accelerating Audio Spectrogram Transformer via Token Merging and Cross-Model Knowledge Distillation
- An Image is Worth 32 Tokens for Reconstruction and Generation
- [ICML 2024] LoCoCo: Dropping In Convolutions for Long Context Compression
- REP: Resource-Efficient Prompting for Rehearsal-Free Continual Learning
- Boosting Diffusion Model for Spectrogram Up-sampling in Text-to-speech: An Empirical Study
- DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effective for LMMs
- [ICML 2024] MLIP: Efficient Multi-Perspective Language-Image Pretraining with Exhaustive Data Utilization
- [EMNLP 2023 Findings] Focus on the Core: Efficient Attention via Pruned Token Compression for Document Classification
- DeCo: Decoupling Token Compression from Semantic Abstraction in Multimodal Large Language Models
- [NeurIPS 2024] Matryoshka Query Transformer for Large Vision-Language Models
- Efficient Time Series Processing for Transformers and State-Space Models through Token Merging
- Matryoshka Multimodal Models
- [NeurIPS 2024] Accelerating Transformers with Spectrum-Preserving Token Merging
- Efficient Point Transformer with Dynamic Token Aggregating for LiDAR Point Cloud Processing
- Sparse-Tuning: Adapting Vision Transformers with Efficient Fine-tuning and Inference
- [MIPR 2024] Segformer++: Efficient Token-Merging Strategies for High-Resolution Semantic Segmentation
- [AAAI 2025] VTG-LLM: Integrating Timestamp Knowledge into Video LLMs for Enhanced Video Temporal Grounding
- [CVPRW 2024] Block Selective Reprogramming for On-device Training of Vision Transformers
- [IJCAI 2024] LeMeViT: Efficient Vision Transformer with Learnable Meta Tokens for Remote Sensing Image Interpretation
- Boosting Multimodal Large Language Models with Visual Tokens Withdrawal for Rapid Inference
- [CVPR 2024] Attention-Driven Training-Free Efficiency Enhancement of Diffusion Models
- [EMNLP 2024] TinyChart: Efficient Chart Understanding with Visual Token Merging and Program-of-Thoughts Learning
- [ICLR 2024] Hierarchical Context Merging: Better Long Context Understanding for Pre-trained LLMs
- Arena: A Patch-of-Interest ViT Inference Acceleration System for Edge-Assisted Video Analytics
- TextHawk: Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models
- CATP: Cross-Attention Token Pruning for Accuracy Preserved Multimodal Model Inference
- [CVPR 2024] HRVDA: High-Resolution Visual Document Assistant
- InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD
- [CVPR 2024] MLP Can Be A Good Transformer Learner
- [CVPR 2024] Decoupling Static and Hierarchical Motion Perception for Referring Video Segmentation
- Training LLMs over Neurally Compressed Text
- [ECCV 2024] LongVLM: Efficient Long Video Understanding via Large Language Models
- [CVPR 2024] Learning to Rank Patches for Unbiased Image Redundancy Reduction
- [CVPR 2024] A General and Efficient Training for Transformer via Token Expansion
- [CVPR 2024] Dense Vision Transformer Compression with Few Samples
- [ECCV 2024] Elysium: Exploring Object-level Perception in Videos via MLLM
- [ECCV 2024] PaPr: Training-Free One-Step Patch Pruning with Lightweight ConvNets for Faster Inference
- LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models
- FIT-RAG: Black-Box RAG with Factual Information and Token Reduction
- [FCCM 2024] Accelerating ViT Inference on FPGA through Static and Dynamic Pruning
- [CVPR 2024] vid-TLDR: Training Free Token Merging for Light-weight Video Transformer
- [ACM TIS 2024] An Analysis on Matching Mechanisms and Token Pruning for Late-interaction Models
- HCPM: Hierarchical Candidates Pruning for Efficient Detector-Free Matching
- [NeurIPS 2024] Dynamic Tuning Towards Parameter and Inference Efficiency for ViT Adaptation
- [CVPR 2024] Multi-criteria Token Fusion with One-step-ahead Attention for Efficient Vision Transformers
- [ECCV 2024] PYRA: Parallel Yielding Re-Activation for Training-Inference Efficient Task Adaptation
- Learnable Community-Aware Transformer for Brain Connectome Analysis with Token Clustering
- [ECCV 2024 Oral] An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Acceleration for VLLM Inference
- [EMNLP 2024 Findings] Unpacking Tokenization: Evaluating Text Compression and its Correlation with Model Performance
- [ECCV 2024] PixArt-Ξ£: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation
- [CVPR 2024] MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer
- Motion Guided Token Compression for Efficient Masked Video Modeling
- [IJCAI 2024] ToDo: Token Downsampling for Efficient Generation of High-Resolution Images
- MobileVLM V2: Faster and Stronger Baseline for Vision Language Model
- Rethinking Optimization and Architecture for Tiny Language Models
- DeSparsify: Adversarial Attack Against Token Sparsification Mechanisms in Vision Transformers
- A Deep Hierarchical Feature Sparse Framework for Occluded Person Re-Identification
- [ECCV 2024] Object-Centric Diffusion for Efficient Video Editing
- [INFOCOM 2024] OTAS: An Elastic Transformer Serving System via Token Adaptation
- HaltingVT: Adaptive Token Halting Transformer for Efficient Video Recognition
- [WACV 2024] TPC-ViT: Token Propagation Controller for Efficient Vision Transformer
- [ECCV 2024] Removing Rows and Columns of Tokens in Vision Transformer Enables Faster Dense Prediction without Retraining
- [ACL 2024] Accelerating Transformers by Sparsifying Information Flows
- [ACL 2024 Findings] What Are You Token About? Differentiable Perturbed Top-k Token Selection for Scientific Document Summarization
- [EMNLP 2024 Findings] Vanessa: Visual Connotation and Aesthetic Attributes Understanding Network for Multimodal Aspect-based Sentiment Analysis
- [NeurIPS 2024] MG-ViT: A Multi-Granularity Method for Compact and Efficient Vision Transformers
- LVP: Language-guided Visual Projector for Efficient Multimodal LLM
- [ECCV 2024] IVTP: Instruction-guided Visual Token Pruning for Large Vision-Language Models
- [AAAI 2024] TMFormer: Token Merging Transformer for Brain Tumor Segmentation with Missing Modalities
- Connectivity-based Token Condensation for Efficient Vision Transformer
- [CVPRW 2024] Efficient Transformer Adaptation with Soft Token Merging
- CAT Pruning: Cluster-Aware Token Pruning For Text-to-Image Diffusion Models
- [ICLRW 2024] Energy Minimizing-based Token Merging for Accelerating Transformers
- RanMerFormer: Randomized Vision Transformer with Token Merging for Brain Tumor Classification
- [NeurIPSW 2024] M2M-TAG: Training-Free Many-to-Many Token Aggregation for Vision Transformer Acceleration
- [ECCV 2024] Efficient Vision Transformers with Partial Attention
- [CVPR 2023] VidToMe: Video Token Merging for Zero-Shot Video Editing
- [ECCV 2024] Agent attention: On the integration of softmax and linear attention
- [CVPR 2024] Honeybee: Locality-enhanced Projector for Multimodal LLM
- [AAAI 2024] Agile-Quant: Activation-Guided Quantization for Faster Inference of LLMs on the Edge
- [WACV 2024] Token Fusion: Bridging the Gap between Token Pruning and Token Merging
- [ECCV 2024] LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
- [CVPR 2024] Align before Adapt: Leveraging Entity-to-Region Alignments for Generalizable Video Action Recognition
- [WACV 2025] TORE: Token Recycling in Vision Transformers for Efficient Active Visual Exploration
- [CVPR 2024] Hourglass Tokenizer for Efficient Transformer-Based 3D Human Pose Estimation
- [ICASSP 2023] SparseSpikformer: A Co-Design Framework for Token and Weight Pruning in Spiking Transformer
- [CVPR 2024 Highlight] Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
- [WACV 2024 Oral] GTP-ViT: Efficient Vision Transformers via Graph-based Token Propagation
- [NeurIPS 2023] AiluRus: A Scalable ViT Framework for Dense Prediction
- [EMNLP 2023] TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding
- Bridging The Gaps Between Token Pruning and Full Pre-training via Masked Fine-tuning
- [EMNLP 2023 Findings] Variator: Accelerating Pre-trained Models with Plug-and-Play Compression Modules
- [EMNLP 2023 Findings] TCRA-LLM: Token Compression Retrieval Augmented Large Language Model for Inference Cost Reduction
- NSM4D: Neural Scene Model Based Online 4D Point Cloud Sequence Understanding
- SparseCoder: Advancing Source Code Analysis with Sparse Attention and Learned Token Pruning
- [Applied Informatics 2023] No Token Left Behind: Efficient Vision Transformer via Dynamic Token Idling
- [ACL 2024] Expedited Training of Visual Conditioned Language Generation via Redundancy Reduction
- PPT: Token Pruning and Pooling for Efficient Vision Transformers
- ELIP: Efficient Language-Image Pre-training with Fewer Vision Tokens
- CAIT: Triple-Win Compression towards High Accuracy, Fast Inference, and Favorable Transferability For ViTs
- [ICLR 2024] Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization
- [ICCV 2023] Attention Where It Matters: Rethinking Visual Document Understanding with Selective Region Concentration
- [ICCV 2023] SG-Former: Self-guided Transformer with Evolving Token Reallocation
- [ICCV 2023] Isomer: Isomerous Transformer for Zero-shot Video Object Segmentation
- [ICCVW 2023] Which Tokens to Use? Investigating Token Reduction in Vision Transformers
- [ICCV 2023] Prune Spatio-temporal Tokens by Semantic-aware Temporal Accumulation
- DiT: Efficient Vision Transformers with Dynamic Token Routing
- [WACV 2023] Dynamic Token-Pass Transformers for Semantic Segmentation
- [ICCV 2023] Dynamic Token Pruning in Plain Vision Transformers for Semantic Segmentation
- [CVPR 2024] MovieChat: From Dense Token to Sparse Memory for Long Video Understanding
- [TMLR 2023] Learned Thresholds Token Merging and Pruning for Vision Transformers
- [ICCVW 2023] MSViT: Dynamic Mixed-Scale Tokenization for Vision Transformers
- [EMNLP 2024] LOOK-M: Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context Inference
- [Interspeech 2023] Accelerating Transducers through Adjacent Token Merging
- [KDD 2023] Constraint-aware and Ranking-distilled Token Pruning for Efficient Transformer Inference
- Vision Transformer with Attention Map Hallucination and FFN Compaction
- [WACV 2023] Revisiting Token Pruning for Object Detection and Instance Segmentation
- Multi-Scale And Token Mergence: Make Your ViT More Efficient
- [CVPR 2023] Content-aware Token Sharing for Efficient Semantic Segmentation with Vision Transformers
- [ICCV 2023] DiffRate: Differentiable Compression Rate for Efficient Vision Transformers
- [ACL 2023] PuMer: Pruning and Merging Tokens for Efficient Vision Language Models
- [ICML 2024] CrossGET: Cross-Guided Ensemble of Tokens for Accelerating Vision-Language Transformers
- [CVPR 2024] Zero-TPrune: Zero-Shot Token Pruning through Leveraging of the Attention Graph in Pre-Trained Transformers
- Multi-scale Efficient Graph-Transformer for Whole Slide Image Classification
- [LREC 2024] SmartTrim: Adaptive Tokens and Attention Pruning for Efficient Vision-Language Models
- Infor-Coef: Information Bottleneck-based Dynamic Token Downsampling for Compact and Efficient Language Model
- [IJCAI 2023] Fast-StrucTexT: An Efficient Hourglass Transformer with Modality-guided Dynamic Token Merge for Document Understanding
- [IJCAI 2023] TG-VQA: Ternary Game of Video Question Answering
- [CVPR 2023] Joint Token Pruning and Squeezing Towards More Aggressive Compression of Vision Transformers
- [CVPR 2023] SViTT: Temporal Learning of Sparse Video-Text Transformers
- [ICCV 2023] Efficient Video Action Detection with Token Dropout and Context Refinement
- Distilling Token-Pruned Pose Transformer for 2D Human Pose Estimation
- [NeurIPS 2023] Conditional Adapters: Parameter-efficient Transfer Learning with Fast Inference
- [TMM 2023] Attention Map Guided Transformer Pruning for Edge Device
- [CVPR 2023] SparseViT: Revisiting Activation Sparsity for Efficient High-Resolution Vision Transformers
- [CVPRW 2023] Token Merging for Fast Stable Diffusion
- [CVPR 2023] Selective Structured State-Spaces for Long-Form Video Understanding
- [CVPR 2023 Highlight] Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning
- [CVPR 2023] Making Vision Transformers Efficient from A Token Sparsification View
- [IPMI 2023] Token Sparsification for Faster Medical Image Segmentation
- [ICMLW 2024] Training-Free Visual Token Compression via Delayed Spatial Merging
- [ECCV 2024] The Role of Masking for Efficient Supervised Knowledge Distillation of Vision Transformers
- [ICLR 2023] A Theoretical Understanding of Shallow Vision Transformers: Learning, Generalization, and Sample Complexity
- [ICML 2023] BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
- Image Compression Is an Effective Objective for Visual Representation Learning
- [EMNLP 2023 Findings] Length-Adaptive Distillation: Customizing Small Language Model for Dynamic Token Pruning
- [EMNLP 2023] Leap-of-Thought: Accelerating Transformers via Dynamic Token Routing
- [ICCV 2023] Building Vision Transformers with Hierarchy Aware Feature Aggregation
- [CVPR 2023] Dynamic Inference with Grounding Based Vision and Language Models
- [ICLR 2023] Sparse Token Transformer With Attention Back Tracking
- [ICLR 2023] Progressively Compressed Auto-Encoder for Self-supervised Representation Learning
- SMAUG: Sparse Masked Autoencoder for Efficient Video-Language Pre-training
- [CVPR 2023] Beyond Attentive Tokens: Incorporating Token Importance and Diversity for Efficient Vision Transformers
- [ICCV 2023] TORE: Token Reduction for Efficient Human Mesh Recovery with Transformer
- [HPCA 2023] HeatViT: Hardware-Efficient Adaptive Token Pruning for Vision Transformers
- [ICCV 2023] Fcaformer: Forward Cross Attention in Hybrid Vision Transformer
- [ICASSP 2023] ProContEXT: Exploring Progressive Context Transformer for Tracking
- [ICLR 2023 Oral] Token Merging: Your ViT But Faster
- SaiT: Sparse Vision Transformers through Adaptive Token Pruning
- [IJCAI 2023] Adaptive Sparse ViT: Towards Learnable Adaptive Token Pruning by Fully Exploiting Self-Attention
- [ECCV 2022] PPT: token-Pruned Pose Transformer for monocular and multi-view human pose estimation
- [TPAMI 2022] Dynamic Spatial Sparsification for Efficient Vision Transformers and Convolutional Neural Networks
- [ACL 2022] Transkimmer: Transformer Learns to Layer-wise Skim
- [CVPR 2022 Oral] Not All Tokens Are Equal: Human-centric Visual Analysis via Token Clustering Transformer
- ITTR: Unpaired Image-to-Image Translation with Transformers
- [AAAI 2023] CF-ViT: A General Coarse-to-Fine Method for Vision Transformer
- [Neural Networks 2022] Multi-Tailed Vision Transformer for Efficient Inference
- [ICLR 2022 Spotlight] Not All Patches are What You Need: Expediting Vision Transformers via Token Reorganizations
- [CVPR 2022] Vision Transformer Slimming: Multi-Dimension Searching in Continuous Optimization Space
- [COLING 2022] Token and Head Adaptive Transformers for Efficient Natural Language Processing
- HFSP: A Hardware-friendly Soft Pruning Framework for Vision Transformers
- [ECCV 2022] SPViT: Enabling Faster Vision Transformers via Latency-aware Soft Token Pruning
- [CVPR 2022 Oral] AdaViT: Adaptive Tokens for Efficient Vision Transformer
- A Study on Token Pruning for ColBERT
- [ECCV 2022 Oral] Adaptive Token Sampling For Efficient Vision Transformers
- [ECCV 2022] Self-slimmed Vision Transformer
- [ECCV 2022] Efficient Video Transformers with Spatial-Temporal Token Selection
- Magic Pyramid: Accelerating Inference with Early Exiting and Token Pruning
- [WACV 2023] Token Pooling in Vision Transformers
- [AAAI 2022] Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer
- [KDD 2022] Learned Token Pruning for Transformers
- [NeurIPS 2021] IA-RED2: Interpretability-Aware Redundancy Reduction for Vision Transformers
- [NeurIPS 2021] TokenLearner: What Can 8 Learned Tokens Do for Images and Videos?
- [NeurIPS 2021] Chasing Sparsity in Vision Transformers: An End-to-End Exploration
- Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer
- [CVPR 2022] Patch Slimming for Efficient Vision Transformers
- [NeurIPS 2021] DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification
- [AAAI 2022] Less is More: Pay Less Attention in Vision Transformers
- [NAACL 2021] TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference
- [ICML 2021] Training data-efficient image transformers & distillation through attention
- [HPCA 2021] SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning
- [ICML 2020] PoWER-BERT: Accelerating BERT Inference via Progressive Word-vector Elimination
Contributions are welcome! If you find relevant papers or have suggestions, feel free to:
- Submit a pull request
- Open an issue
- Contact me at [email protected]
This work is licensed under the CC0 1.0 Universal License. To the extent possible under law, Sangmin Woo has waived all copyright and related or neighboring rights to this work.