Haoqin Tu*, Chenhang Cui*, Zijun Wang*, Yiyang Zhou, Bingchen Zhao, Junlin Han, Wangchunshu Zhou, Huaxiu Yao, Cihang Xie (*equal contribution)
Our paper is online now: https://arxiv.org/abs/2311.16101
For different VLLMs, please refer to their specific envirnments for installation.
- LLaVA: https://github.com/haotian-liu/LLaVA
- MiniGPT4: https://github.com/Vision-CAIR/MiniGPT-4
- InstructBLIP: https://huggingface.co/Salesforce/instructblip-vicuna-7b
- LLaMA-Adapter: https://github.com/OpenGVLab/LLaMA-Adapter
- mPLUG-Owl1&2: https://github.com/X-PLUG/mPLUG-Owl
- PandaGPT: https://github.com/yxuansu/PandaGPT
- Qwen-VL-Chat: https://huggingface.co/Qwen/Qwen-VL-Chat
- CogVLM: https://github.com/THUDM/CogVLM
- InternLM-Xcomposer: https://huggingface.co/internlm/internlm-xcomposer-7b
- Fuyu: https://huggingface.co/adept/fuyu-8b
We host our datasets here, containing both OOD and redteaming attack datasets. The full dataset should looks like this:
.
├── ./safety_evaluation_benchmark_datasets//
├── gpt4v_challenging_set # Contains the challenging test data for GPT4V
├── attack_images
├── sketchy_images
├── oodcv_images
├── misleading-attack.json
├── sketchy-vqa-challenging.json
└── oodcv-vqa-counterfactual.json
├── redteaming # Contains the test data for redteaming tasks
├── misleading_attack
├── gaussian_noise
├── mixattack_eps32
├── mixattack_eps64
├── sinattack_eps64_dog
├── sinattack_eps64_coconut
├── sinattack_eps64_spaceship
└── annotation.json
├── jailbreak_vit # adversarial images for jailbreaking VLLM through ViT
└── jailbreak_llm # adversarial suffixes for jailbreaking VLLM through LLM
└── ood # Contains the test data for OOD scenarios
├── sketchy-vqa
├── sketchy-vqa.json
├── sketchy-challenging.json
└── oodcv-vqa
├── oodcv-vqa.json
└── oodcv-counterfactual.json
For ood/oodcv-vqa
.
For ood/sketchy-vqa/skechydata/
, then unzip it.
For the proposed misleading attack, the full datasets and all trained adversarial examples are in redteaming/misleading_attack
, including images with gaussian noise, Sin.Attack and MixAttack with two pertubation budgets
For jailbreaking methods, please refer to their respective repositories for more dataset details: Jailbreak through ViT, Jailbreak through LLM.
Before you start, make sure you have modified the CACHE_DIR
(where you store all your model weights) and DATA_DIR
(where you store the benchmark data) in baselines/config.json
according to your local envirnment.
cd baselines
python ../model_testing_zoo.py --model_name LLaVA
Choose --model_name
from ["LlamaAdapterV2", "MiniGPT4", "MiniGPT4v2", "LLaVA", "mPLUGOwl", "mPLUGOwl2", "PandaGPT", "InstructBLIP2", "Flamingo", "LLaVAv1.5", "LLaVAv1.5-13B", "LLaVA_llama2-13B", "MiniGPT4_llama2", "Qwen-VL-Chat", "MiniGPT4_13B", "InstructBLIP2-FlanT5-xl", "InstructBLIP2-FlanT5-xxl", "InstructBLIP2-13B", "CogVLM", "Fuyu", "InternLM"].
For
cd baselines
python ../safety_evaluations/ood_scenarios/evaluation.py --model_name LLaVA --eval_oodcv
For the counterfactual version:
cd baselines
python ../safety_evaluations/ood_scenarios/evaluation.py --model_name LLaVA --eval_oodcv_cf
For
cd baselines
python ../safety_evaluations/ood_scenarios/evaluation.py --model_name LLaVA --eval_sketch
For the challenging version:
cd baselines
python ../safety_evaluations/ood_scenarios/evaluation.py --model_name LLaVA --eval_sketch_challenging
For training the misleading adversarial images:
cd safety_evaluations/redteaming/misleading_vision_attack
python misleading_vis_attack.py --lr 1e-3 --misleading_obj dog --input_folder path/to/attack-bard/NIPS2017 --output_folder ./misleading_adversarial_attack
Change --input_folder
to the path of adversarial examples you want to test. If you want to use the MixAttack, add --mix_obj
argument to the command.
For testing the VLLMs:
cd baselines
python ../safety_evaluations/redteaming/misleading_vision_attack/test_misleading.py --image_folder redteaming/misleading_attack/mixattack_eps64 --output_name misleading_attack_eps64 --human_annot_path redteaming/misleading_attack/annotation.json
Please refer to these two repositories for detailed attack settings: Jailbreak through ViT, Jailbreak through LLM. We give our trained adversarial images and suffixes for jailbreaking ViTs and LLMs in redteaming/jailbreak_vit
and redteaming/jailbreak_llm
in the data folder.
The data, code and checkpoint is intended and licensed for research use only. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.
If you find our work useful to your research and applications, please consider citing the paper and staring the repo :)
@article{tu2023how,
title={How Many Unicorns Are In This Image? A Safety Evaluation Benchmark For Vision LLMs},
author={Tu, Haoqin and Cui, Chenhang and Wang, Zijun and Zhou, Yiyang and Zhao, Bingchen and Han, Junlin and Zhou, Wangchunshu and Yao, Huaxiu and Xie, Cihang},
journal={arXiv preprint arXiv:2311.16101},
year={2023}
}
This work is partially supported by a gift from Open Philanthropy. We thank Center for AI Safety and Google Cloud for supporting our computing needs. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the sponsors.