Skip to content

Measuring correlations between safety benchmarks and general AI capabilities benchmarks.

License

Notifications You must be signed in to change notification settings

centerforaisafety/safetywashing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?

arXiv

This repository contains code and data for the paper "Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?".

"Safetywashing" refers to the practice of misrepresenting capabilities improvements as safety advancements in AI systems. This project provides tools to evaluate AI models on various safety and capabilities benchmarks, and analyzes the correlations between these benchmarks. In our paper, we empirically investigate whether common AI safety benchmarks actually measure distinct safety properties or are primarily determined by upstream model capabilities.

Repository Structure

  • analysis.py: Main script for running correlation analyses
  • data/: Contains benchmark datasets and model results
    • benchmarks_base_models.csv: Benchmark matrix for base language models
    • benchmarks_chat_models.csv: benchmark matrix for chat/instruction-tuned models
    • benchmarks_info.csv: Metadata about benchmarks

To add a new benchmark, add a column in the benchmark matrix, and then add a row in benchmarks_info to add metadata.

Citation

@misc{ren2024safetywashing,
      title={Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?}, 
      author={Richard Ren and Steven Basart and Adam Khoja and Alice Gatti and Long Phan and Xuwang Yin and Mantas Mazeika and Alexander Pan and Gabriel Mukobi and Ryan H. Kim and Stephen Fitz and Dan Hendrycks},
      year={2024},
      eprint={2407.21792},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2407.21792}, 
}

About

Measuring correlations between safety benchmarks and general AI capabilities benchmarks.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Languages