Skip to content

A method for learning to Self-guide Multimodal Reasoning in the Wild

License

Notifications You must be signed in to change notification settings

aibee00/SocraticQuestioning

Repository files navigation

Socratic Questioning: Learn to Self-guide Multimodal Reasoning in the Wild


This is the official completion of the paper: "Socratic Questioning: Learn to Self-guide Multimodal Reasoning in the Wild.".

Our dataset is available at CapQA.

How to use

The training code in this repo may be out of date, you can use the latest code from LLAVA.

Please note that you need to replace the original llava/mm_utils with the new llava/mm_utils from this repository, as there are several newly added functions implementing our "Socratic Questioning" methodology.

The evaluation procedure is very much like that of LLaVA. Please download eval.zip and unzip it under ./playground/data/.

Weight

Model Type Model Name
base model vicuna-7b-v1.5
lora model sq-llava-v1.5-7b-lora

Finetune

You may need to modify the data_path,vision_tower,pretrain_mm_mlp_adapter,output_dir with you own local path.

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/v1_5/finetune_lora_capQA.sh

Evaluation

CapQA

  1. Under ./playground/data/eval/capqa, download CapQA from huggingface.
  2. Single-GPU inference and evaluate.
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/capqa.sh

Evaluating 3-Turn Inference

CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/capqa_3turn.sh

ScienceQA

  1. Under ./playground/data/eval/scienceqa, download images, pid_splits.json, problems.json from the data/scienceqa folder of the ScienceQA repo.
  2. Single-GPU inference and evaluate.

Evaluating 1-Turn Inference

CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/sqa.sh

Evaluating 3-Turn Inference

CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/sqa_3turn.sh

TextVQA

  1. Download TextVQA_0.5.1_val.json and images and extract to ./playground/data/eval/textvqa.
  2. Single-GPU inference and evaluate.

Evaluating 1-Turn Inference

CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/textvqa.sh

Evaluating 3-Turn Inference

CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/textvqa_3turn.sh

MM-Vet

  1. Extract mm-vet.zip to ./playground/data/eval/mmvet.
  2. Single-GPU inference.

Evaluating 1-Turn Inference

CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/mmvet.sh

Evaluating 3-Turn Inference

CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/mmvet_3turn.sh
  1. Evaluate the predictions in ./playground/data/eval/mmvet/results using the official jupyter notebook according to the instructions here MM-Vet.

Citation

@article{SocraticQuestioning2025,
  title={Socratic Questioning: Learn to Self-guide Multimodal Reasoning in the Wild},
  author={Wanpeng Hu, Haodi Liu, Lin Chen, Feng Zhou, Changming Xiao, Qi Yang, Changshui Zhang},
  journal={arXiv preprint arXiv:2501.02964},
  year={2025},
  url={https://arxiv.org/abs/2501.02964}
}

About

A method for learning to Self-guide Multimodal Reasoning in the Wild

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •