This is the official completion of the paper: "Socratic Questioning: Learn to Self-guide Multimodal Reasoning in the Wild.".
Our dataset is available at CapQA.
The training code in this repo may be out of date, you can use the latest code from LLAVA.
Please note that you need to replace the original llava/mm_utils
with the new llava/mm_utils
from this repository, as there are several newly added functions implementing our "Socratic Questioning" methodology.
The evaluation procedure is very much like that of LLaVA. Please download eval.zip and unzip it under ./playground/data/
.
Model Type | Model Name |
---|---|
base model | vicuna-7b-v1.5 |
lora model | sq-llava-v1.5-7b-lora |
You may need to modify the data_path
,vision_tower
,pretrain_mm_mlp_adapter
,output_dir
with you own local path.
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/v1_5/finetune_lora_capQA.sh
- Under
./playground/data/eval/capqa
, download CapQA from huggingface. - Single-GPU inference and evaluate.
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/capqa.sh
Evaluating 3-Turn Inference
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/capqa_3turn.sh
- Under
./playground/data/eval/scienceqa
, downloadimages
,pid_splits.json
,problems.json
from thedata/scienceqa
folder of the ScienceQA repo. - Single-GPU inference and evaluate.
Evaluating 1-Turn Inference
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/sqa.sh
Evaluating 3-Turn Inference
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/sqa_3turn.sh
- Download
TextVQA_0.5.1_val.json
and images and extract to./playground/data/eval/textvqa
. - Single-GPU inference and evaluate.
Evaluating 1-Turn Inference
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/textvqa.sh
Evaluating 3-Turn Inference
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/textvqa_3turn.sh
- Extract
mm-vet.zip
to./playground/data/eval/mmvet
. - Single-GPU inference.
Evaluating 1-Turn Inference
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/mmvet.sh
Evaluating 3-Turn Inference
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/mmvet_3turn.sh
- Evaluate the predictions in
./playground/data/eval/mmvet/results
using the official jupyter notebook according to the instructions here MM-Vet.
@article{SocraticQuestioning2025,
title={Socratic Questioning: Learn to Self-guide Multimodal Reasoning in the Wild},
author={Wanpeng Hu, Haodi Liu, Lin Chen, Feng Zhou, Changming Xiao, Qi Yang, Changshui Zhang},
journal={arXiv preprint arXiv:2501.02964},
year={2025},
url={https://arxiv.org/abs/2501.02964}
}