🔎 Compression with Global Guidance: Towards Training-free High-Resolution MLLMs Acceleration 🚀

Xuyang Liu¹, Ziming Wang², Yuhang Han³, Yingyao Wang², Jiale Yuan², Jun Song^2✉, Bo Zheng²,
Linfeng Zhang⁴, Siteng Huang⁵, Honggang Chen^1✉

¹Sichuan University, ²Taobao & Tmall Group of Alibaba,
³Northeast Forestry University, ⁴Shanghai Jiaotong University, ⁵Zhejiang University

🔥 News

2025.01.10 🤗🤗 We release our latest work GlobalCom², a "global-to-local" approach for training-free acceleration of high-resolution MLLMs. Code is available!
2024.11.17 🤗🤗 We release our work FiCoCo which proposes a unified paradigm to demystify the popular works and guide the future designs of training-free token reduction for MLLMs. Code is available!

✨ Overview

TLDR: We present GlobalCom², a novel token compression method for high-resolution MLLMs that uses thumbnail tokens to guide crop compression.

Abstract

Multimodal large language models (MLLMs) have attracted considerable attention due to their exceptional performance in visual content understanding and reasoning. However, their inference efficiency has been a notable concern, as the increasing length of multimodal contexts leads to quadratic complexity. Token compression techniques, which reduce the number of visual tokens, have demonstrated their effectiveness in reducing computational costs. Yet, these approaches have struggled to keep pace with the rapid advancements in MLLMs, especially the AnyRes strategy in the context of high-resolution image understanding. In this paper, we propose a novel token compression method, GlobalCom², tailored for high-resolution MLLMs that receive both the thumbnail and multiple crops. GlobalCom² treats the tokens derived from the thumbnail as the “commander” of the entire token compression process, directing the allocation of retention ratios and the specific compression for each crop. In this way, redundant tokens are eliminated while important local details are adaptively preserved to the highest extent feasible. Empirical results across 10 benchmarks reveal that GlobalCom² achieves an optimal balance between performance and efficiency, and consistently outperforms state-of-the-art token compression methods with LLaVA-NeXT-7B/13B models.

💥 Core Codes

The two key functions in llava/model/llava_arch.py implement our global-guided local compression: (a) generate_scale_for_crop_features for allocating optimal retention ratios based on each crop's global importance, and (b) interpolate_and_split_cls_attn_scores for performing token compression with importance from the global perspective.

🛠 Preparation

Clone this repository.

git clone https://github.com/xuyang-liu16/GlobalCom2.git
cd GlobalCom2

Environment Setup and Preparation

 conda create -n GlobalCom2 python=3.10 -y
 conda activate GlobalCom2
 pip install -e .

Download Multimodal Benchmark

Please follow the detailed instruction in LLaVA-Evaluation.

Download LLaVA-NeXT-7B and LLaVA-NeXT-13B and put them under ./liuhaotian/llava-next-7b and ./liuhaotian/llava-next-13b.

For users with limited access to Hugging Face (e.g., from mainland China), you can refer to this you can refer this alternative guide and use the following command, with LLaVA-NeXT-7B as an example:

pip install -U huggingface_hub hf_transfer -i https://mirrors.aliyun.com/pypi/simple/
export HF_ENDPOINT=https://hf-mirror.com
huggingface-cli download --resume-download liuhaotian/llava-v1.6-vicuna-7b --local-dir ./liuhaotian/llava-next-7b

🚀 Evaluation

👉 The only hyper-parameter is retention_ratio in line 101 of llava/model/llava_arch.py. You can achieve different acceleration effects by setting different retention_ratio values (default retention_ratio = 0.25).

Example for evaluating TextVQA results:

CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/textvqa.sh

Example for evaluating MME results:

CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/mme.sh

To calculate the theoretical computational efficiency shown above, we recommend the methodology presented in the work of LLM-Viewer. We deeply appreciate their outstanding contribution to this field.

🩻 Visualization

To visualize the compression performance shown above, we recommend utilizing the visualization tools provided in tools, which include mask visualization and attention score visualization utilities. We hope these tools will assist in understanding the compression mechanism.

📌 Citation

Please consider citing our paper in your publications, if our findings help your research.

@article{Liu2025:GlobalCom,
    title={Compression with Global Guidance: Towards Training-free High-Resolution MLLMs Acceleration}, 
    author={Xuyang Liu and Ziming Wang and Yuhang Han and Yingyao Wang and Jiale Yuan and Jun Song and Bo Zheng and Linfeng Zhang and Siteng Huang and Honggang Chen},
    year={2025},
    eprint={2501.05179},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}

💻 Related Works

Awesome Token Reduction for Model Compression: An open-source repository that curates a collection of recent awesome papers on token reduction for model compression.
FiCoCo: A systematic study that proposes a unified "filter-correlate-compress" paradigm for training-free token reduction in MLLMs, achieving up to 82.4% FLOPs reduction while maintaining model performance.

👍 Acknowledgment

We extend our gratitude to the open-source efforts of LLaVA and LLM-Viewer.

📩 Contact

For any question about our paper or code, please email [email protected].

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
images		images
llava		llava
playground/data		playground/data
scripts		scripts
tools		tools
LICENSE		LICENSE
README.md		README.md
cog.yaml		cog.yaml
predict.py		predict.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🔎 Compression with Global Guidance: Towards Training-free High-Resolution MLLMs Acceleration 🚀

Xuyang Liu¹, Ziming Wang², Yuhang Han³, Yingyao Wang², Jiale Yuan², Jun Song^2✉, Bo Zheng²,
Linfeng Zhang⁴, Siteng Huang⁵, Honggang Chen^1✉

¹Sichuan University, ²Taobao & Tmall Group of Alibaba,
³Northeast Forestry University, ⁴Shanghai Jiaotong University, ⁵Zhejiang University

🔥 News

✨ Overview

💥 Core Codes

🛠 Preparation

🚀 Evaluation

🩻 Visualization

📌 Citation

💻 Related Works

👍 Acknowledgment

📩 Contact

About

Releases

Packages

Languages

License

xuyang-liu16/GlobalCom2

Folders and files

Latest commit

History

Repository files navigation

🔎 Compression with Global Guidance: Towards Training-free High-Resolution MLLMs Acceleration 🚀

Xuyang Liu1, Ziming Wang2, Yuhang Han3, Yingyao Wang2, Jiale Yuan2, Jun Song2✉, Bo Zheng2, Linfeng Zhang4, Siteng Huang5, Honggang Chen1✉ 1Sichuan University, 2Taobao & Tmall Group of Alibaba, 3Northeast Forestry University, 4Shanghai Jiaotong University, 5Zhejiang University

🔥 News

✨ Overview

💥 Core Codes

🛠 Preparation

🚀 Evaluation

🩻 Visualization

📌 Citation

💻 Related Works

👍 Acknowledgment

📩 Contact

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages