Socratic Questioning: Learn to Self-guide Multimodal Reasoning in the Wild

This is the official completion of the paper: "Socratic Questioning: Learn to Self-guide Multimodal Reasoning in the Wild.".

Our dataset is available at CapQA.

How to use

The training code in this repo may be out of date, you can use the latest code from LLAVA.

Please note that you need to replace the original llava/mm_utils with the new llava/mm_utils from this repository, as there are several newly added functions implementing our "Socratic Questioning" methodology.

The evaluation procedure is very much like that of LLaVA. Please download eval.zip and unzip it under ./playground/data/.

Weight

Model Type	Model Name
base model	vicuna-7b-v1.5
lora model	sq-llava-v1.5-7b-lora

Finetune

You may need to modify the data_path,vision_tower,pretrain_mm_mlp_adapter,output_dir with you own local path.

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/v1_5/finetune_lora_capQA.sh

Evaluation

CapQA

Under ./playground/data/eval/capqa, download CapQA from huggingface.
Single-GPU inference and evaluate.

CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/capqa.sh

Evaluating 3-Turn Inference

CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/capqa_3turn.sh

ScienceQA

Under ./playground/data/eval/scienceqa, download images, pid_splits.json, problems.json from the data/scienceqa folder of the ScienceQA repo.
Single-GPU inference and evaluate.

Evaluating 1-Turn Inference

CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/sqa.sh

Evaluating 3-Turn Inference

CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/sqa_3turn.sh

TextVQA

Download TextVQA_0.5.1_val.json and images and extract to ./playground/data/eval/textvqa.
Single-GPU inference and evaluate.

Evaluating 1-Turn Inference

CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/textvqa.sh

Evaluating 3-Turn Inference

CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/textvqa_3turn.sh

MM-Vet

Extract mm-vet.zip to ./playground/data/eval/mmvet.
Single-GPU inference.

Evaluating 1-Turn Inference

CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/mmvet.sh

Evaluating 3-Turn Inference

CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/mmvet_3turn.sh

Evaluate the predictions in ./playground/data/eval/mmvet/results using the official jupyter notebook according to the instructions here MM-Vet.

Citation

@article{SocraticQuestioning2025,
  title={Socratic Questioning: Learn to Self-guide Multimodal Reasoning in the Wild},
  author={Wanpeng Hu, Haodi Liu, Lin Chen, Feng Zhou, Changming Xiao, Qi Yang, Changshui Zhang},
  journal={arXiv preprint arXiv:2501.02964},
  year={2025},
  url={https://arxiv.org/abs/2501.02964}
}

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
.devcontainer		.devcontainer
docs		docs
images		images
llava		llava
playground/data		playground/data
scripts		scripts
.dockerignore		.dockerignore
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
cog.yaml		cog.yaml
predict.py		predict.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Socratic Questioning: Learn to Self-guide Multimodal Reasoning in the Wild

How to use

Weight

Finetune

Evaluation

CapQA

ScienceQA

TextVQA

MM-Vet

Citation

About

Releases

Packages

Contributors 4

Languages

License

aibee00/SocraticQuestioning

Folders and files

Latest commit

History

Repository files navigation

Socratic Questioning: Learn to Self-guide Multimodal Reasoning in the Wild

How to use

Weight

Finetune

Evaluation

CapQA

ScienceQA

TextVQA

MM-Vet

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages