We benchmark the image subset of ScienceQA validation and test set, and report the Top-1 accuracy.
During evaluation, we use
GPT-3.5-Turbo-0613
as the choice extractor for all VLMs if the choice can not be extracted via heuristic matching. Zero-shot inference is adopted.
Model | ScienceQA-Image Val | ScienceQA-Image Test |
---|---|---|
InternLM-XComposer-VL | 88.0 | 89.8 |
Human Performance | N/A | 87.5 |
SharedCaptioner | 81.0 | 82.3 |
GPT-4v (detail: low) | 84.6 | 82.1 |
GeminiProVision | 80.1 | 81.4 |
LLaVA-InternLM2-20B (QLoRA) | 72.7 | 73.7 |
Monkey | 68.2 | 72.1 |
LLaVA-v1.5-13B | 69.2 | 72 |
TransCore-M | 68.8 | 71.2 |
LLaVA-v1.5-13B (QLoRA) | 68.9 | 70.3 |
mPLUG-Owl2 | 69.5 | 69.5 |
ShareGPT4V-7B | 68.1 | 69.4 |
LLaVA-v1.5-7B | 66.6 | 68.9 |
Qwen-VL-Chat | 65.5 | 68.8 |
LLaVA-v1.5-7B (QLoRA) | 68.8 | 68.7 |
LLaVA-InternLM-7B (QLoRA) | 65.3 | 68.4 |
EMU2-Chat | 65.3 | 68.2 |
CogVLM-17B-Chat | 65.6 | 66.2 |
PandaGPT-13B | 60.9 | 63.2 |
IDEFICS-80B-Instruct | 59.9 | 61.8 |
Qwen-VL | 57.7 | 61.1 |
LLaVA-v1-7B | 59.9 | 60.5 |
InstructBLIP-13B | 53.3 | 58.3 |
VisualGLM | 53.4 | 56.1 |
MiniGPT-4-v2 | 54.1 | 54.7 |
InstructBLIP-7B | 54.7 | 54.1 |
IDEFICS-9B-Instruct | 51.6 | 53.5 |
MiniGPT-4-v1-13B | 44.3 | 46 |
OpenFlamingo v2 | 45.7 | 44.8 |
MiniGPT-4-v1-7B | 39.0 | 39.6 |