Tongtian Yue1,3* ,
Jie Cheng2,3* ,
Longteng Guo1,3* ,
Xingyuan Dai2,3 ,
Zijia Zhao1,3 ,
Xingjian He1,3
Gang Xiong2,3
Yisheng Lv2,3
Jing Liu1,3†
1Laboratory of Cognition and Decision Intelligence for Complex Systems, CASIA
2State Key Laboratory of Multimodal Artificial Intelligence Systems, CASIA
3School of Artificial Intelligence, University of Chinese Academy of Sciences
CVPR, 2024
Create a conda environment and install dependencies:
conda create -n sc_tune python=3.10
conda activate sc_tune
pip install -r requirements.txt
Download the Qwen-VL-Chat checkpoint (10 *.bin files in total) to the path Qwen-VL-Chat/
and Object365 images.
Note
We have modified the codes in Qwen-VL-Chat/visual.py
. Please replace the original file with the one in this repo if necessary.
Set the path of Object365 images in scripts/finetune_ds.sh
. Other hyperparameters can also be found in this file.
sh scripts/finetune_ds.sh
The main codes to implement sc-tune method are in transformers/trainer.py
and transformers/trainer_utils.py
.
This repo benefits from Qwen-VL, TRL, and MOSS. Thanks for their wonderful work.
@article{yue2024sc,
title={SC-Tune: Unleashing Self-Consistent Referential Comprehension in Large Vision Language Models},
author={Yue, Tongtian and Cheng, Jie and Guo, Longteng and Dai, Xingyuan and Zhao, Zijia and He, Xingjian and Xiong, Gang and Lv, Yisheng and Liu, Jing},
journal={arXiv preprint arXiv:2403.13263},
year={2024}
}