Skip to content

olivernan1025/WildVision-Bench

 
 

Repository files navigation

WildVision-Bench

WildVision-Bench

  • Insipired by Alpaca-Eval, Arena-Hard, WildBench for the evaluation in the pure text setting, we propose WildVision-Bench to evaluate vision-language models in the wild with human preferences.
  • Specifically, we collect real-world QA pairs from the 🤗 WildVision-Arena collections and curated 500 high quality examples for the evaluation.
  • We use GPT-4o as the judgement model and Claude-3-Sonnet as our baseline model. The judge model is used to compare the model's response with the baseline model's response and decide which one is better.
  • WildVision-Bench achives 0.94 correlation with the WildVision-Arena elo ranking, which is a good indicator for the model's performance in the wild.

installation

As part of the wildvision arena environment, please install the following dependencies:

conda create -n visionbench python==3.9
pip install -e .

WVBench Image-Instruct Pair

🤗 WildVision-Bench

You can get image-instruct pair of WVBench-500 to generate your model answers by loading bench data below:

wildbench_data = load_dataset('WildVision/wildvision-bench', config_name='vision_bench_0617', split='test')

We have two versions of Wildvision-Bench data

  • vision_bench_0617: the selected 500 examples that best simulates the vision-arena elo ranking, same data in the paper.
  • vision_bench_0701: the further filter and selected 500 examples by NSFW and manual selection. Leaderboard are still preparing.

Note: For now, if you want to evaluate your model, please use the vision_bench_0617 version to fairly compare the performance with other models in the following leaderboard. We are preparing the leaderboard for vision_bench_0701 and will update it soon.

Evaluation

1. Generate model answers

Example model answers files is shown in data/vision_bench_0617/example_model_answers.jsonl. You need to fill the following fields:

  • output: the model's output
  • model: the model name
  • token_len: the token length of the output (using tiktoken's gpt-3.5-turbo tokenizer) Then put the file in data/vision_bench_0617/model_answers/{model_name}.jsonl.

We provide example inference scripts in gen_answers.py and run_vllm.py to generate model answers. You can use the following command to generate model answers:

python run_vllm.py --tokenizer_mode "auto" --max_model_len 65536 --num_gpu 1 --model_name "meta-llama/Llama-3.2-11B-Vision-Instruct"

2. Get judgements

First, go to config/judge_config.yaml and add the models you want to evaluate in the model_list field. For example:

# Add your model below for evaluation
model_list:
  - liuhaotian/llava-v1.6-vicuna-7b
  - openbmb/MiniCPM-V
  - deepseek-ai/deepseek-vl-7b-chat
  - BAAI/Bunny-v1_0-3B
  - Salesforce/instructblip-vicuna-7b

Then, run the following command:

python get_judgement.py

Results will be saved in data/release_bench_0617/model_judgements/judge_gpt-4o_reference_claude-3-sonnet-20240229/{model_name}.jsonl

3. Show the results

python show_results.py

Then you will see the results as following leaderboard。

Using lmmseval to evaluate

LMMSEval is a Python package integrated with multiple MLLM's inference and evaluation tools. WildVision-Bench is one of the supported benchmarks. You can use the following command to evaluate your model on WildVision-Bench:

First, install lmmseval:

pip install lmmseval

Then, run the following command:

model_type=llava_hf
pretrained=llava-hf/llava-1.5-7b-hf
model_name=llava-1.5-7b-hf
python3 -m accelerate.commands.launch \
    --num_processes=8 \
    -m lmms_eval \
    --model $model_type \
    --model_args "pretrained=$pretrained" \
    --tasks wildvision_0617 \
    --batch_size 1 \
    --log_samples \
    --log_samples_suffix $model_name \
    --output_path ./logs/

Then find your lmmseval log dir for this evaluation, and run the following command to get the leaderboard:

python format_lmmseval_answers.py --lmmseval_log_dir {lmmseval_log_dir} --model_name {model_name}
python show_results.py

Leaderboard (vision_bench_0617)

Model Score 95% CI Win Rate Reward Much Better Better Tie Worse Much Worse Avg Tokens
gpt-4o 89.15 (-1.9, 1.5) 80.6% 56.4 255.0 148.0 14.0 72.0 11.0 142
gpt-4-vision-preview 79.78 (-2.9, 2.2) 71.8% 39.4 182.0 177.0 22.0 91.0 28.0 138
aria 74.1 (-2.9, 3.2) 68.0% 28.8 88.0 252.0 47.0 86.0 27.0 185
Reka-Flash 64.65 (-2.6, 2.7) 58.8% 18.9 135.0 159.0 28.0 116.0 62.0 168
claude-3-opus-20240229 62.03 (-3.7, 2.8) 53.0% 13.5 103.0 162.0 48.0 141.0 46.0 105
yi-vl-plus 55.05 (-3.4, 2.3) 52.8% 7.2 98.0 166.0 29.0 124.0 83.0 140
liuhaotian/llava-v1.6-34b 51.89 (-3.4, 3.8) 49.2% 2.5 90.0 156.0 26.0 145.0 83.0 153
claude-3-sonnet-20240229 50.0 (0.0, 0.0) 0.2% 0.1 0.0 1.0 499.0 0.0 0.0 114
claude-3-haiku-20240307 37.83 (-2.6, 2.8) 30.6% -16.5 54.0 99.0 47.0 228.0 72.0 89
gemini-pro-vision 35.57 (-3.0, 3.2) 32.6% -21.0 80.0 83.0 27.0 167.0 143.0 68
liuhaotian/llava-v1.6-vicuna-13b 33.87 (-2.9, 3.3) 33.8% -21.4 62.0 107.0 25.0 167.0 139.0 136
deepseek-ai/deepseek-vl-7b-chat 33.61 (-3.3, 3.0) 35.6% -21.2 59.0 119.0 17.0 161.0 144.0 116
THUDM/cogvlm-chat-hf 32.01 (-2.2, 3.0) 30.6% -26.4 75.0 78.0 15.0 172.0 160.0 61
liuhaotian/llava-v1.6-vicuna-7b 26.41 (-3.3, 3.1) 27.0% -31.4 45.0 90.0 36.0 164.0 165.0 130
idefics2-8b-chatty 23.96 (-2.2, 2.4) 26.4% -35.8 44.0 88.0 19.0 164.0 185.0 135
Qwen/Qwen-VL-Chat 18.08 (-1.9, 2.2) 19.6% -47.9 42.0 56.0 15.0 155.0 232.0 69
llava-1.5-7b-hf 15.5 (-2.4, 2.4) 18.0% -47.8 28.0 62.0 25.0 174.0 211.0 185
liuhaotian/llava-v1.5-13b 14.43 (-1.7, 1.6) 16.8% -52.5 28.0 56.0 19.0 157.0 240.0 91
BAAI/Bunny-v1_0-3B 12.98 (-2.0, 2.1) 16.6% -54.4 23.0 60.0 10.0 164.0 243.0 72
openbmb/MiniCPM-V 11.95 (-2.4, 2.1) 13.6% -57.5 25.0 43.0 16.0 164.0 252.0 86
bczhou/tiny-llava-v1-hf 8.3 (-1.6, 1.2) 11.0% -66.2 16.0 39.0 15.0 127.0 303.0 72
unum-cloud/uform-gen2-qwen-500m 7.81 (-1.3, 1.7) 10.8% -68.5 16.0 38.0 11.0 115.0 320.0 92

Contributing to the leaderboard

If you want to contribute to the leaderboard, please follow the steps below:

  1. Fork this repository.
  2. Add your model's answers to the data/vision_bench_0617/model_answers/ directory.
  3. Add your model's judgements to the data/release_bench_0617/model_judgements/ directory.
  4. Run python show_results.py to generate the leaderboard.
  5. Copy the elo_leaderboard.md file and paste it into the above "Leaderboard" section.
  6. Create a pull request.

Acknowledgment

We thank LMSYS for their great work on https://chat.lmsys.org/. Our code base is adapted from https://github.com/lm-sys/arena-hard-auto.

Thanks lmmseval for integrating WildVision-Bench into their evaluation platform.

Citation

If you found this repository useful, please consider cite our paper and resources:

@article{lu2024wildvision,
  title={WildVision: Evaluating Vision-Language Models in the Wild with Human Preferences},
  author={Lu, Yujie and Jiang, Dongfu and Chen, Wenhu and Wang, William Yang and Choi, Yejin and Lin, Bill Yuchen},
  publisher={NeurIPS},
  year={2024}
}
@misc{yujie2024wildvisionarena,
    title={WildVision Arena: Benchmarking Multimodal LLMs in the Wild},
    url={https://huggingface.co/spaces/WildVision/vision-arena/},
    author={Lu, Yujie and Jiang, Dongfu and Chen, Hui and Ma, Yingzi and Gu, Jing and Xiao, Chaowei and Chen, Wenhu and Wang, William and Choi, Yejin and Lin, Bill Yuchen},
    year={2024}
}
@misc{yujie2024wildvisionv2,
    title={WildVision Data and Model},
    url={https://huggingface.co/WildVision},
    author={Lu, Yujie* and Jiang, Dongfu* and Chen, Hui* and Fu, Xingyu and Ma, Yingzi and Gu, Jing and Saxon, Michael and Xiao, Chaowei and Chen, Wenhu and Choi, Yejin and Lin, Bill Yuchen and Eckstein, Miguel and Wang, William},
    year={2024}
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 96.7%
  • Shell 3.3%