Skip to content

sangminwoo/awesome-token-reduction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 

Repository files navigation

🧩 Awesome Token Reduction Awesome

A curated list of up-to-date papers on token + {reduction, pruning, merging, compression, sparsification} in transformers, large language models, large vision-language models, diffusion models, etc.

2025

  • Exploring the Role of Explicit Temporal Modeling in Multimodal Large Language Models for Video Understanding
  • Contextual Reinforcement in Multimodal Token Compression for Large Language Models
  • Dynamic Token Reduction during Generation for Vision Language Models
  • LVPruning: An Effective yet Simple Language-Guided Vision Token Pruning Approach for Multi-modal Large Language Models
  • InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling
  • AdaFV: Accelerating VLMs with Self-Adaptive Cross-Modality Attention Mixture
  • VASparse: Towards Efficient Visual Hallucination Mitigation for Large Vision-Language Model via Visual-Aware Sparsification
  • Compression with Global Guidance: Towards Training-free High-Resolution MLLMs Acceleration
  • LLaVA-Octopus: Unlocking Instruction-Driven Adaptive Projector Fusion for Video Understanding
  • TREAD: Token Routing for Efficient Architecture-agnostic Diffusion Training
  • LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token
  • FOLDER: Accelerating Multi-modal Large Language Models with Enhanced Performance
  • What Kind of Visual Tokens Do We Need? Training-free Visual Token Pruning for Multi-modal Large Language Models from the Perspective of Graph
  • FrameFusion: Combining Similarity and Importance for Video Token Reduction on Large Visual Language Models
  • Cached Adaptive Token Merging: Dynamic Token Reduction and Redundant Computation Elimination in Diffusion Model
  • VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling
  • Token Pruning for Caching Better: 9Γ— Acceleration on Stable Diffusion for Free
  • [ICASSP 2025] Cross-Layer Cache Aggregation for Token Reduction in Ultra-Fine-Grained Image Recognition
  • [AAAI 2025] Training-Free and Hardware-Friendly Acceleration for Diffusion Models via Similarity-based Token Pruning

2024

  • [TPAMI 2024] Hierarchical Banzhaf Interaction for General Video-Language Representation Learning
  • ReTaKe: Reducing Temporal and Knowledge Redundancy for Long Video Understanding
  • [AAAI 2025] ST3: Accelerating Multimodal Large Language Model by Spatial-Temporal Visual Token Trimming
  • Layer- and Timestep-Adaptive Differentiable Token Compression Ratios for Efficient Diffusion Transformers
  • [AAAI 2025] ImagePiece: Content-aware Re-tokenization for Efficient Image Recognition
  • PruneVid: Visual Token Pruning for Efficient Video Large Language Models
  • Incorporating Feature Pyramid Tokenization and Open Vocabulary Semantic Segmentation
  • FastVLM: Efficient Vision Encoding for Vision Language Models
  • Feather the Throttle: Revisiting Visual Token Pruning for Vision-Language Model Acceleration
  • Faster Vision Mamba is Rebuilt in Minutes via Merged Token Re-training
  • SpeechPrune: Context-aware Token Pruning for Speech Information Retrieval
  • AsymRnR: Video Diffusion Transformers Acceleration with Asymmetric Reduction and Restoration
  • FTP: A Fine-grained Token-wise Pruner for Large Language Models via Token Routing
  • OmniVLM: A Token-Compressed, Sub-Billion-Parameter Vision-Language Model for Efficient On-Device Inference
  • [AAAI 2025] Attention-driven GUI Grounding: Leveraging Pretrained Multimodal Large Language Models without Fine-Tuning
  • Memory Efficient Matting with Adaptive Token Routing
  • [NeurIPS 2024] Learning to Merge Tokens via Decoupled Embedding for Efficient Vision Transformers
  • FovealNet: Advancing AI-Driven Gaze Tracking Solutions for Optimized Foveated Rendering System Performance in Virtual Reality
  • SweetTokenizer: Semantic-Aware Spatial-Temporal Tokenizer for Compact Visual Discretization
  • B-VLLM: A Vision Large Language Model with Balanced Spatio-Temporal Tokens
  • PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models
  • Dynamic-VLM: Simple Dynamic Visual Token Compression for VideoLLM
  • Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition
  • LLaVA-Zip: Adaptive Visual Token Compression with Intrinsic Image Information
  • TRIM: Token Reduction and Inference Modeling for Cost-Effective Language Generation
  • RADIO Amplified: Improved Baselines for Agglomerative Vision Foundation Models
  • SafeWatch: An Efficient Safety-Policy Following Video Guardrail Model with Transparent Explanations
  • Pruning All-Rounder: Rethinking and Improving Inference Efficiency for Large Vision Language Models
  • iLLaVA: An Image is Worth Fewer Than 1/3 Input Tokens in Large Multimodal Models
  • [CLS] Token Tells Everything Needed for Training-free Efficient MLLMs
  • VisionZip: Longer is Better but Not Necessary in Vision Language Models
  • p-MoD: Building Mixture-of-Depths MLLMs via Progressive Ratio Decay
  • A Stitch in Time Saves Nine: Small VLM is a Precise Guidance for Accelerating Large VLMs
  • AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning
  • 3D Representation in 512-Byte: Variational Tokenizer is the Key for Autoregressive 3D Generation
  • [CLS] Attention is All You Need for Training-Free Visual Token Pruning: Make VLM Inference Faster
  • Negative Token Merging: Image-based Adversarial Feature Guidance
  • Token Cropr: Faster ViTs for Quite a Few Tasks
  • Accelerating Multimodal Large Language Models by Searching Optimal Vision Token Reduction
  • ATP-LLaVA: Adaptive Token Pruning for Large Vision Language Models
  • Accelerating Multimodal Large Language Models via Dynamic Visual-Token Exit and the Empirical Findings
  • TimeMarker: A Versatile Video-LLM for Long and Short Video Understanding with Superior Temporal Localization Ability
  • Training Noise Token Pruning
  • Efficient Multi-modal Large Language Models via Visual Token Grouping
  • Rethinking Token Reduction in MLLMs: Towards a Unified Paradigm for Training-Free Acceleration
  • Attamba: Attending To Multi-Token States
  • [NeurIPSW 2024] ShowUI: One Vision-Language-Action Model for GUI Visual Agent
  • Importance-based Token Merging for Diffusion Models
  • Enhancing Instruction-Following Capability of Visual-Language Models by Reducing Image Redundancy
  • freePruner: A Training-free Approach for Large Multimodal Model Acceleration
  • Efficient Online Inference of Vision Transformers by Training-Free Tokenization
  • DyCoke: Dynamic Compression of Tokens for Fast Video Large Language Models
  • LLaVA-MR: Large Language-and-Vision Assistant for Video Moment Retrieval
  • FOCUS: Knowledge-enhanced Adaptive Visual Compression for Few-shot Whole Slide Image Classification
  • Beyond Training: Dynamic Token Merging for Zero-Shot Video Understanding
  • FocusLLaVA: A Coarse-to-Fine Approach for Efficient and Effective Visual Token Compression
  • FoPru: Focal Pruning for Efficient Large Vision-Language Models
  • Principles of Visual Tokens for Efficient Video Understanding
  • LaVida Drive: Vision-Text Interaction VLM for Autonomous Driving with Token Selection, Recovery and Enhancement
  • TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models
  • Multi-Stage Vision Token Dropping: Towards Efficient Multimodal Large Language Model
  • Multidimensional Byte Pair Encoding: Shortened Sequences for Improved Visual Data Generation
  • [NeurIPS 2024] Token Merging for Training-Free Semantic Binding in Text-to-Image Synthesis
  • [NeurIPS 2024 Spotlight] Don’t Look Twice: Faster Video Transformers with Run-Length Tokenization
  • Inference Optimal VLMs Need Only One Visual Token but Larger Models
  • PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance
  • [NeurIPS 2024] EDT: An Efficient Diffusion Transformer Framework Inspired by Human-like Sketching
  • [NeurIPS 2024] Video Token Merging for Long-form Video Understanding
  • MrT5: Dynamic Token Merging for Efficient Byte-level Language Models
  • Rethinking Visual Dependency in Long-Context Reasoning for Large Vision-Language Models
  • LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding
  • PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction
  • xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs
  • CAGE: Causal Attention Enables Data-Efficient Generalizable Robotic Manipulation
  • Is Less More? Exploring Token Condensation as Training-free Adaptation for CLIP
  • [EMNLP 2024] Rethinking Token Reduction for State Space Models
  • Efficient Vision-Language Models by Summarizing Visual Tokens into Compact Registers
  • Long-LRM: Long-sequence Large Reconstruction Model for Wide-coverage Gaussian Splats
  • VidCompress: Memory-Enhanced Temporal Compression for Video Understanding in Large Language Models
  • big.LITTLE Vision Transformer for Efficient Visual Recognition
  • [NeurIPSW 2024] Token Pruning using a Lightweight Background Aware Vision Transformer
  • ZipVL: Efficient Large Vision-Language Models with Dynamic Token Sparsification
  • PAR: Prompt-Aware Token Reduction Method for Efficient Large Multimodal Models
  • Rodimus*: Breaking the Accuracy-Efficiency Trade-Off with Efficient Attentions
  • Treat Visual Tokens as Text? But Your MLLM Only Needs Fewer Efforts to See
  • TextHawk2: A Large Vision-Language Model Excels in Bilingual OCR and Grounding with 16x Fewer Tokens
  • SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference
  • [EMNLP 2024 Findings] From Reading to Compressing: Exploring the Multi-document Reader for Prompt Compression
  • [ICLR 2025] Dynamic Diffusion Transformer
  • AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark
  • [EMNLP 2024] FastAdaSP: Multitask-Adapted Efficient Inference for Large Speech Language Model
  • AVG-LLaVA: A Large Multimodal Model with Adaptive Visual Granularity
  • Cut the Crap: An Economical Communication Pipeline for LLM-based Multi-Agent Systems
  • Locret: Enhancing Eviction in Long-Context LLM Inference with Trained Retaining Heads
  • VMAD: Visual-enhanced Multimodal Large Language Model for Zero-Shot Anomaly Detection
  • [NeurIPS 2024] One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos
  • [NeurIPS 2024] Exploring Token Pruning in Vision State Space Models
  • Token Caching for Diffusion Transformer Acceleration
  • Discovering the Gems in Early Layers: Accelerating Long-Context LLMs with 1000x Input Token Reduction
  • [WACV 2025] Patch Ranking: Token Pruning as Ranking Prediction for Efficient CLIP
  • Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding
  • [ECCV 2024] Agglomerative Token Clustering
  • Video Token Sparsification for Efficient Multimodal LLMs in Autonomous Driving
  • [COLING 2025] Less is More: A Simple yet Effective Token Reduction Method for Efficient Multi-modal LLMs
  • CSKV: Training-Efficient Channel Shrinking for KV Cache in Long-Context Scenarios
  • [AAAI 2025] Fit and Prune: Fast and Training-free Visual Token Pruning for Multi-modal Large Language Models
  • [ECCVW 2024] Famba-V: Fast Vision Mamba with Cross-Layer Token Fusion
  • TG-LLaVA: Text Guided LLaVA via Learnable Latent Embeddings
  • [WACV 2025] VLTP: Vision-Language Guided Token Pruning for Task-Oriented Segmentation
  • Enhancing Long Video Understanding via Hierarchical Event-Based Memory
  • Qihoo-T2X: An Efficient Proxy-Tokenized Diffusion Transformer for Text-to-Any-Task
  • mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page Document Understanding
  • TC-LLaVA: Rethinking the Transfer from Image to Video Understanding with Temporal Consideration
  • LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via a Hybrid Architecture
  • [AAAI 2025] Recoverable Compression: A Multimodal Vision Token Recovery Mechanism Guided by Text Information
  • Balancing Performance and Efficiency: A Multimodal Large Language Model Pruning Method based on Image Text Interaction
  • [ICLR 2025] TempMe: Video Temporal Token Merging for Efficient Text-Video Retrieval
  • [ECCV 2024] Make Your ViT-based Multi-view 3D Detectors Faster via Token Compression
  • Vote&Mix: Plug-and-Play Token Reduction for Efficient Vision Transformer
  • TReX- Reusing Vision Transformer's Attention for Efficient Xbar-based Computing
  • AT-SNN: Adaptive Tokens for Vision Transformer on Spiking Neural Network
  • Practical token pruning for foundation models in few-shot conversational virtual assistant systems
  • [AAAI 2025] HiRED: Attention-Guided Token Dropping for Efficient Inference of High-Resolution Vision-Language Models in Resource-Constrained Environments
  • Dynamic and Compressive Adaptation of Transformers From Images to Videos
  • [ECCV 2024] Token Compensator: Altering Inference Cost of Vision Transformer without Re-Tuning
  • [ACMMM 2024] ReToMe-VA: Recursive Token Merging for Video Diffusion-based Unrestricted Adversarial Attack
  • A2SF: Accumulative Attention Scoring with Forgetting Factor for Token Pruning in Transformer Decoder
  • Exploring The Neural Burden In Pruned Models: An Insight Inspired By Neuroscience
  • [NeurIPS 2024] Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data?
  • SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models
  • Efficient Visual Transformer by Learnable Token Merging
  • Token-level Correlation-guided Compression for Efficient Multimodal Document Understanding
  • LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference
  • Pose-guided Multi-task Video Transformer for Driver Action Recognition
  • [ECCV 2024] LookupViT: Compressing Visual Information to a Limited Number of Tokens
  • [TPAMI 2024] TCFormer: Visual Recognition via Token Clustering Transformer
  • [ECCV 2024] Turbo: Informativity-Driven Acceleration Plug-In for Vision-Language Large Models
  • [ECCV 2024] GTPT: Group-based Token Pruning Transformer for Efficient Human Pose Estimation
  • [Interpeech] LiteFocus: Accelerated Diffusion Inference for Long Audio Synthesis
  • [ICMLW 2024] Characterizing Prompt Compression Methods for Long Context Inference
  • HiRes-LLaVA: Restoring Fragmentation Input in High-Resolution Large Vision-Language Models
  • Pruning One More Token is Enough: Leveraging Latency-Workload Non-Linearities for Vision Transformers on the Edge
  • PRANCE: Joint Token-Optimization and Structural Channel-Pruning for Adaptive ViT Inference
  • ALPINE: An Adaptive Language-Agnostic Pruning Method for Language Models for Code
  • TokenPacker: Efficient Visual Projector for Multimodal LLM
  • [ECCV 2024] LPViT: Low-Power Semi-structured Pruning for Vision Transformers
  • [ACL 2024 Findings] Concise and Precise Context Compression for Tool-Using Language Models
  • DiffIR2VR-Zero: Zero-Shot Video Restoration with Diffusion-based Image Restoration Models
  • [ICASSP 2023] Papez: Resource-Efficient Speech Separation with Auditory Working Memory
  • [NeurIPS 2024] Efficient Large Multi-modal Models via Visual Context Compression
  • [AAAI 2025] DocKylin: A Large Multimodal Model for Visual Document Understanding with Efficient Visual Slimming
  • [CVPR 2024] ScanFormer: Referring Expression Comprehension by Iteratively Scanning
  • [ICLR 2025] Dynamic Discriminative Operations for Efficient Generative Inference of Large Language Models
  • [EMNLP 2024] Bridging Local Details and Global Context in Text-Attributed Graphs
  • [EMNLP 2024] Attention Score is not All You Need for Token Importance Indicator in KV Cache Reduction: Value Also Matters
  • VoCo-LLaMA: Towards Vision Compression with Large Language Models
  • Refiner: Restructure Retrieval Content Efficiently to Advance Question-Answering Capabilities
  • [CVPR 2024] ALGM: Adaptive Local-then-Global Token Merging for Efficient Semantic Segmentation with Plain Vision Transformers
  • SViTT-Ego: A Sparse Video-Text Transformer for Egocentric Video
  • [NeurIPS 2024] COVE: Unleashing the Diffusion Feature Correspondence for Consistent Video Editing
  • [CVPRW 2024] ToSA: Token Selective Attention for Efficient Vision Transformers
  • [Interspeech 2024] FastAST: Accelerating Audio Spectrogram Transformer via Token Merging and Cross-Model Knowledge Distillation
  • An Image is Worth 32 Tokens for Reconstruction and Generation
  • [ICML 2024] LoCoCo: Dropping In Convolutions for Long Context Compression
  • REP: Resource-Efficient Prompting for Rehearsal-Free Continual Learning
  • Boosting Diffusion Model for Spectrogram Up-sampling in Text-to-speech: An Empirical Study
  • DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effective for LMMs
  • [ICML 2024] MLIP: Efficient Multi-Perspective Language-Image Pretraining with Exhaustive Data Utilization
  • [EMNLP 2023 Findings] Focus on the Core: Efficient Attention via Pruned Token Compression for Document Classification
  • DeCo: Decoupling Token Compression from Semantic Abstraction in Multimodal Large Language Models
  • [NeurIPS 2024] Matryoshka Query Transformer for Large Vision-Language Models
  • Efficient Time Series Processing for Transformers and State-Space Models through Token Merging
  • Matryoshka Multimodal Models
  • [NeurIPS 2024] Accelerating Transformers with Spectrum-Preserving Token Merging
  • Efficient Point Transformer with Dynamic Token Aggregating for LiDAR Point Cloud Processing
  • Sparse-Tuning: Adapting Vision Transformers with Efficient Fine-tuning and Inference
  • [MIPR 2024] Segformer++: Efficient Token-Merging Strategies for High-Resolution Semantic Segmentation
  • [AAAI 2025] VTG-LLM: Integrating Timestamp Knowledge into Video LLMs for Enhanced Video Temporal Grounding
  • [CVPRW 2024] Block Selective Reprogramming for On-device Training of Vision Transformers
  • [IJCAI 2024] LeMeViT: Efficient Vision Transformer with Learnable Meta Tokens for Remote Sensing Image Interpretation
  • Boosting Multimodal Large Language Models with Visual Tokens Withdrawal for Rapid Inference
  • [CVPR 2024] Attention-Driven Training-Free Efficiency Enhancement of Diffusion Models
  • [EMNLP 2024] TinyChart: Efficient Chart Understanding with Visual Token Merging and Program-of-Thoughts Learning
  • [ICLR 2024] Hierarchical Context Merging: Better Long Context Understanding for Pre-trained LLMs
  • Arena: A Patch-of-Interest ViT Inference Acceleration System for Edge-Assisted Video Analytics
  • TextHawk: Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models
  • CATP: Cross-Attention Token Pruning for Accuracy Preserved Multimodal Model Inference
  • [CVPR 2024] HRVDA: High-Resolution Visual Document Assistant
  • InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD
  • [CVPR 2024] MLP Can Be A Good Transformer Learner
  • [CVPR 2024] Decoupling Static and Hierarchical Motion Perception for Referring Video Segmentation
  • Training LLMs over Neurally Compressed Text
  • [ECCV 2024] LongVLM: Efficient Long Video Understanding via Large Language Models
  • [CVPR 2024] Learning to Rank Patches for Unbiased Image Redundancy Reduction
  • [CVPR 2024] A General and Efficient Training for Transformer via Token Expansion
  • [CVPR 2024] Dense Vision Transformer Compression with Few Samples
  • [ECCV 2024] Elysium: Exploring Object-level Perception in Videos via MLLM
  • [ECCV 2024] PaPr: Training-Free One-Step Patch Pruning with Lightweight ConvNets for Faster Inference
  • LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models
  • FIT-RAG: Black-Box RAG with Factual Information and Token Reduction
  • [FCCM 2024] Accelerating ViT Inference on FPGA through Static and Dynamic Pruning
  • [CVPR 2024] vid-TLDR: Training Free Token Merging for Light-weight Video Transformer
  • [ACM TIS 2024] An Analysis on Matching Mechanisms and Token Pruning for Late-interaction Models
  • HCPM: Hierarchical Candidates Pruning for Efficient Detector-Free Matching
  • [NeurIPS 2024] Dynamic Tuning Towards Parameter and Inference Efficiency for ViT Adaptation
  • [CVPR 2024] Multi-criteria Token Fusion with One-step-ahead Attention for Efficient Vision Transformers
  • [ECCV 2024] PYRA: Parallel Yielding Re-Activation for Training-Inference Efficient Task Adaptation
  • Learnable Community-Aware Transformer for Brain Connectome Analysis with Token Clustering
  • [ECCV 2024 Oral] An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Acceleration for VLLM Inference
  • [EMNLP 2024 Findings] Unpacking Tokenization: Evaluating Text Compression and its Correlation with Model Performance
  • [ECCV 2024] PixArt-Ξ£: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation
  • [CVPR 2024] MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer
  • Motion Guided Token Compression for Efficient Masked Video Modeling
  • [IJCAI 2024] ToDo: Token Downsampling for Efficient Generation of High-Resolution Images
  • MobileVLM V2: Faster and Stronger Baseline for Vision Language Model
  • Rethinking Optimization and Architecture for Tiny Language Models
  • DeSparsify: Adversarial Attack Against Token Sparsification Mechanisms in Vision Transformers
  • A Deep Hierarchical Feature Sparse Framework for Occluded Person Re-Identification
  • [ECCV 2024] Object-Centric Diffusion for Efficient Video Editing
  • [INFOCOM 2024] OTAS: An Elastic Transformer Serving System via Token Adaptation
  • HaltingVT: Adaptive Token Halting Transformer for Efficient Video Recognition
  • [WACV 2024] TPC-ViT: Token Propagation Controller for Efficient Vision Transformer
  • [ECCV 2024] Removing Rows and Columns of Tokens in Vision Transformer Enables Faster Dense Prediction without Retraining
  • [ACL 2024] Accelerating Transformers by Sparsifying Information Flows
  • [ACL 2024 Findings] What Are You Token About? Differentiable Perturbed Top-k Token Selection for Scientific Document Summarization
  • [EMNLP 2024 Findings] Vanessa: Visual Connotation and Aesthetic Attributes Understanding Network for Multimodal Aspect-based Sentiment Analysis
  • [NeurIPS 2024] MG-ViT: A Multi-Granularity Method for Compact and Efficient Vision Transformers
  • LVP: Language-guided Visual Projector for Efficient Multimodal LLM
  • [ECCV 2024] IVTP: Instruction-guided Visual Token Pruning for Large Vision-Language Models
  • [AAAI 2024] TMFormer: Token Merging Transformer for Brain Tumor Segmentation with Missing Modalities
  • Connectivity-based Token Condensation for Efficient Vision Transformer
  • [CVPRW 2024] Efficient Transformer Adaptation with Soft Token Merging
  • CAT Pruning: Cluster-Aware Token Pruning For Text-to-Image Diffusion Models
  • [ICLRW 2024] Energy Minimizing-based Token Merging for Accelerating Transformers
  • RanMerFormer: Randomized Vision Transformer with Token Merging for Brain Tumor Classification
  • [NeurIPSW 2024] M2M-TAG: Training-Free Many-to-Many Token Aggregation for Vision Transformer Acceleration
  • [ECCV 2024] Efficient Vision Transformers with Partial Attention

2023

  • [CVPR 2023] VidToMe: Video Token Merging for Zero-Shot Video Editing
  • [ECCV 2024] Agent attention: On the integration of softmax and linear attention
  • [CVPR 2024] Honeybee: Locality-enhanced Projector for Multimodal LLM
  • [AAAI 2024] Agile-Quant: Activation-Guided Quantization for Faster Inference of LLMs on the Edge
  • [WACV 2024] Token Fusion: Bridging the Gap between Token Pruning and Token Merging
  • [ECCV 2024] LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
  • [CVPR 2024] Align before Adapt: Leveraging Entity-to-Region Alignments for Generalizable Video Action Recognition
  • [WACV 2025] TORE: Token Recycling in Vision Transformers for Efficient Active Visual Exploration
  • [CVPR 2024] Hourglass Tokenizer for Efficient Transformer-Based 3D Human Pose Estimation
  • [ICASSP 2023] SparseSpikformer: A Co-Design Framework for Token and Weight Pruning in Spiking Transformer
  • [CVPR 2024 Highlight] Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
  • [WACV 2024 Oral] GTP-ViT: Efficient Vision Transformers via Graph-based Token Propagation
  • [NeurIPS 2023] AiluRus: A Scalable ViT Framework for Dense Prediction
  • [EMNLP 2023] TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding
  • Bridging The Gaps Between Token Pruning and Full Pre-training via Masked Fine-tuning
  • [EMNLP 2023 Findings] Variator: Accelerating Pre-trained Models with Plug-and-Play Compression Modules
  • [EMNLP 2023 Findings] TCRA-LLM: Token Compression Retrieval Augmented Large Language Model for Inference Cost Reduction
  • NSM4D: Neural Scene Model Based Online 4D Point Cloud Sequence Understanding
  • SparseCoder: Advancing Source Code Analysis with Sparse Attention and Learned Token Pruning
  • [Applied Informatics 2023] No Token Left Behind: Efficient Vision Transformer via Dynamic Token Idling
  • [ACL 2024] Expedited Training of Visual Conditioned Language Generation via Redundancy Reduction
  • PPT: Token Pruning and Pooling for Efficient Vision Transformers
  • ELIP: Efficient Language-Image Pre-training with Fewer Vision Tokens
  • CAIT: Triple-Win Compression towards High Accuracy, Fast Inference, and Favorable Transferability For ViTs
  • [ICLR 2024] Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization
  • [ICCV 2023] Attention Where It Matters: Rethinking Visual Document Understanding with Selective Region Concentration
  • [ICCV 2023] SG-Former: Self-guided Transformer with Evolving Token Reallocation
  • [ICCV 2023] Isomer: Isomerous Transformer for Zero-shot Video Object Segmentation
  • [ICCVW 2023] Which Tokens to Use? Investigating Token Reduction in Vision Transformers
  • [ICCV 2023] Prune Spatio-temporal Tokens by Semantic-aware Temporal Accumulation
  • DiT: Efficient Vision Transformers with Dynamic Token Routing
  • [WACV 2023] Dynamic Token-Pass Transformers for Semantic Segmentation
  • [ICCV 2023] Dynamic Token Pruning in Plain Vision Transformers for Semantic Segmentation
  • [CVPR 2024] MovieChat: From Dense Token to Sparse Memory for Long Video Understanding
  • [TMLR 2023] Learned Thresholds Token Merging and Pruning for Vision Transformers
  • [ICCVW 2023] MSViT: Dynamic Mixed-Scale Tokenization for Vision Transformers
  • [EMNLP 2024] LOOK-M: Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context Inference
  • [Interspeech 2023] Accelerating Transducers through Adjacent Token Merging
  • [KDD 2023] Constraint-aware and Ranking-distilled Token Pruning for Efficient Transformer Inference
  • Vision Transformer with Attention Map Hallucination and FFN Compaction
  • [WACV 2023] Revisiting Token Pruning for Object Detection and Instance Segmentation
  • Multi-Scale And Token Mergence: Make Your ViT More Efficient
  • [CVPR 2023] Content-aware Token Sharing for Efficient Semantic Segmentation with Vision Transformers
  • [ICCV 2023] DiffRate: Differentiable Compression Rate for Efficient Vision Transformers
  • [ACL 2023] PuMer: Pruning and Merging Tokens for Efficient Vision Language Models
  • [ICML 2024] CrossGET: Cross-Guided Ensemble of Tokens for Accelerating Vision-Language Transformers
  • [CVPR 2024] Zero-TPrune: Zero-Shot Token Pruning through Leveraging of the Attention Graph in Pre-Trained Transformers
  • Multi-scale Efficient Graph-Transformer for Whole Slide Image Classification
  • [LREC 2024] SmartTrim: Adaptive Tokens and Attention Pruning for Efficient Vision-Language Models
  • Infor-Coef: Information Bottleneck-based Dynamic Token Downsampling for Compact and Efficient Language Model
  • [IJCAI 2023] Fast-StrucTexT: An Efficient Hourglass Transformer with Modality-guided Dynamic Token Merge for Document Understanding
  • [IJCAI 2023] TG-VQA: Ternary Game of Video Question Answering
  • [CVPR 2023] Joint Token Pruning and Squeezing Towards More Aggressive Compression of Vision Transformers
  • [CVPR 2023] SViTT: Temporal Learning of Sparse Video-Text Transformers
  • [ICCV 2023] Efficient Video Action Detection with Token Dropout and Context Refinement
  • Distilling Token-Pruned Pose Transformer for 2D Human Pose Estimation
  • [NeurIPS 2023] Conditional Adapters: Parameter-efficient Transfer Learning with Fast Inference
  • [TMM 2023] Attention Map Guided Transformer Pruning for Edge Device
  • [CVPR 2023] SparseViT: Revisiting Activation Sparsity for Efficient High-Resolution Vision Transformers
  • [CVPRW 2023] Token Merging for Fast Stable Diffusion
  • [CVPR 2023] Selective Structured State-Spaces for Long-Form Video Understanding
  • [CVPR 2023 Highlight] Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning
  • [CVPR 2023] Making Vision Transformers Efficient from A Token Sparsification View
  • [IPMI 2023] Token Sparsification for Faster Medical Image Segmentation
  • [ICMLW 2024] Training-Free Visual Token Compression via Delayed Spatial Merging
  • [ECCV 2024] The Role of Masking for Efficient Supervised Knowledge Distillation of Vision Transformers
  • [ICLR 2023] A Theoretical Understanding of Shallow Vision Transformers: Learning, Generalization, and Sample Complexity
  • [ICML 2023] BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
  • Image Compression Is an Effective Objective for Visual Representation Learning
  • [EMNLP 2023 Findings] Length-Adaptive Distillation: Customizing Small Language Model for Dynamic Token Pruning
  • [EMNLP 2023] Leap-of-Thought: Accelerating Transformers via Dynamic Token Routing
  • [ICCV 2023] Building Vision Transformers with Hierarchy Aware Feature Aggregation
  • [CVPR 2023] Dynamic Inference with Grounding Based Vision and Language Models
  • [ICLR 2023] Sparse Token Transformer With Attention Back Tracking
  • [ICLR 2023] Progressively Compressed Auto-Encoder for Self-supervised Representation Learning

2022

  • SMAUG: Sparse Masked Autoencoder for Efficient Video-Language Pre-training
  • [CVPR 2023] Beyond Attentive Tokens: Incorporating Token Importance and Diversity for Efficient Vision Transformers
  • [ICCV 2023] TORE: Token Reduction for Efficient Human Mesh Recovery with Transformer
  • [HPCA 2023] HeatViT: Hardware-Efficient Adaptive Token Pruning for Vision Transformers
  • [ICCV 2023] Fcaformer: Forward Cross Attention in Hybrid Vision Transformer
  • [ICASSP 2023] ProContEXT: Exploring Progressive Context Transformer for Tracking
  • [ICLR 2023 Oral] Token Merging: Your ViT But Faster
  • SaiT: Sparse Vision Transformers through Adaptive Token Pruning
  • [IJCAI 2023] Adaptive Sparse ViT: Towards Learnable Adaptive Token Pruning by Fully Exploiting Self-Attention
  • [ECCV 2022] PPT: token-Pruned Pose Transformer for monocular and multi-view human pose estimation
  • [TPAMI 2022] Dynamic Spatial Sparsification for Efficient Vision Transformers and Convolutional Neural Networks
  • [ACL 2022] Transkimmer: Transformer Learns to Layer-wise Skim
  • [CVPR 2022 Oral] Not All Tokens Are Equal: Human-centric Visual Analysis via Token Clustering Transformer
  • ITTR: Unpaired Image-to-Image Translation with Transformers
  • [AAAI 2023] CF-ViT: A General Coarse-to-Fine Method for Vision Transformer
  • [Neural Networks 2022] Multi-Tailed Vision Transformer for Efficient Inference
  • [ICLR 2022 Spotlight] Not All Patches are What You Need: Expediting Vision Transformers via Token Reorganizations
  • [CVPR 2022] Vision Transformer Slimming: Multi-Dimension Searching in Continuous Optimization Space
  • [COLING 2022] Token and Head Adaptive Transformers for Efficient Natural Language Processing
  • HFSP: A Hardware-friendly Soft Pruning Framework for Vision Transformers

2021

  • [ECCV 2022] SPViT: Enabling Faster Vision Transformers via Latency-aware Soft Token Pruning
  • [CVPR 2022 Oral] AdaViT: Adaptive Tokens for Efficient Vision Transformer
  • A Study on Token Pruning for ColBERT
  • [ECCV 2022 Oral] Adaptive Token Sampling For Efficient Vision Transformers
  • [ECCV 2022] Self-slimmed Vision Transformer
  • [ECCV 2022] Efficient Video Transformers with Spatial-Temporal Token Selection
  • Magic Pyramid: Accelerating Inference with Early Exiting and Token Pruning
  • [WACV 2023] Token Pooling in Vision Transformers
  • [AAAI 2022] Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer
  • [KDD 2022] Learned Token Pruning for Transformers
  • [NeurIPS 2021] IA-RED2: Interpretability-Aware Redundancy Reduction for Vision Transformers
  • [NeurIPS 2021] TokenLearner: What Can 8 Learned Tokens Do for Images and Videos?
  • [NeurIPS 2021] Chasing Sparsity in Vision Transformers: An End-to-End Exploration
  • Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer
  • [CVPR 2022] Patch Slimming for Efficient Vision Transformers
  • [NeurIPS 2021] DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification
  • [AAAI 2022] Less is More: Pay Less Attention in Vision Transformers
  • [NAACL 2021] TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference

2020

  • [ICML 2021] Training data-efficient image transformers & distillation through attention
  • [HPCA 2021] SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning
  • [ICML 2020] PoWER-BERT: Accelerating BERT Inference via Progressive Word-vector Elimination

🀝 Contributing

Contributions are welcome! If you find relevant papers or have suggestions, feel free to:

πŸ“œ License

CC0

This work is licensed under the CC0 1.0 Universal License. To the extent possible under law, Sangmin Woo has waived all copyright and related or neighboring rights to this work.

Releases

No releases published

Packages

No packages published