Automatic Pseudo-Harmful Prompt Generation for Evaluating False Refusals in Large Language Models

by Bang An*, Sicheng Zhu* , Ruiyi Zhang , Michael-Andrei Panaitescu-Liess , Yuancheng Xu , Furong Huang

About

Safety-aligned large language models (LLMs) sometimes falsely refuse pseudo-harmful prompts, like "how to kill a mosquito," which are actually harmless. Frequent false refusals not only frustrate users but also provoke a public backlash against the very values alignment seeks to protect. In this paper, we propose the first method to auto-generate diverse, content-controlled, and model-dependent pseudo-harmful prompts. Using this method, we construct an evaluation dataset called PHTest, which is ten times larger than existing datasets, covers more false refusal patterns, and separately labels controversial prompts. We evaluate 20 LLMs on PHTest, uncovering new insights due to its scale and labeling. Our findings reveal a trade-off between minimizing false refusals and improving safety against jailbreak attacks. Moreover, we show that many jailbreak defenses significantly increase the false refusal rates, thereby undermining usability. Our method and dataset can help developers evaluate and fine-tune safer and more usable LLMs.

Dataset

Please find the PHTest dataset on Hugging Face

Code

PHTest is generated based on a controllable text-generation technique called AutoDAN. Our method offers a tool for automatic model-targeted false refusal red-teaming.

Code is coming soon!

Citing

If you find our work helpful, please cite it with:

@inproceedings{
an2024automatic,
title={Automatic Pseudo-Harmful Prompt Generation for Evaluating False Refusals in Large Language Models},
author={Bang An and Sicheng Zhu and Ruiyi Zhang and Michael-Andrei Panaitescu-Liess and Yuancheng Xu and Furong Huang},
booktitle={First Conference on Language Modeling},
year={2024},
url={https://openreview.net/forum?id=ljFgX6A8NL}
}

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
README.md		README.md
php_examples.png		php_examples.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Automatic Pseudo-Harmful Prompt Generation for Evaluating False Refusals in Large Language Models

About

Dataset

Code

Citing

About

Releases

Packages

umd-huang-lab/FalseRefusal

Folders and files

Latest commit

History

Repository files navigation

Automatic Pseudo-Harmful Prompt Generation for Evaluating False Refusals in Large Language Models

About

Dataset

Code

Citing

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages