Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?

This repository contains code and data for the paper "Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?".

"Safetywashing" refers to the practice of misrepresenting capabilities improvements as safety advancements in AI systems. This project provides tools to evaluate AI models on various safety and capabilities benchmarks, and analyzes the correlations between these benchmarks. In our paper, we empirically investigate whether common AI safety benchmarks actually measure distinct safety properties or are primarily determined by upstream model capabilities.

Repository Structure

analysis.py: Main script for running correlation analyses
data/: Contains benchmark datasets and model results
- benchmarks_base_models.csv: Benchmark matrix for base language models
- benchmarks_chat_models.csv: benchmark matrix for chat/instruction-tuned models
- benchmarks_info.csv: Metadata about benchmarks

To add a new benchmark, add a column in the benchmark matrix, and then add a row in benchmarks_info to add metadata.

Citation

@misc{ren2024safetywashing,
      title={Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?}, 
      author={Richard Ren and Steven Basart and Adam Khoja and Alice Gatti and Long Phan and Xuwang Yin and Mantas Mazeika and Alexander Pan and Gabriel Mukobi and Ryan H. Kim and Stephen Fitz and Dan Hendrycks},
      year={2024},
      eprint={2407.21792},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2407.21792}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
data		data
LICENSE		LICENSE
README.md		README.md
analysis.py		analysis.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?

Repository Structure

Citation

About

Releases

Contributors 2

Languages

License

centerforaisafety/safetywashing

Folders and files

Latest commit

History

Repository files navigation

Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?

Repository Structure

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Contributors 2

Languages