This repository includes the implementation of some experiments in the scope of predicting sensitive concepts (protected attributes such as ethnicity or gender) in language models to enhance the models interpretability. It includes the code to reproduce the papers:
- Sarah Schröder, Alexander Schulz and Barbara Hammer. "Evaluating Concept Discovery Methods for Sensitive Attributes in Language Models". Accepted at ESANN 2025.
TODO
- BIOS
- TwitterAAE
- Jigsaw Unintended Bias
- CrowSPairs
- Huggingface Models (using this Wrapper)
- OpenAI Embedding Models
- Concept Activation Vectors (CAV)
- Concept Bottleneck Models (CBM)
- Bias Subspaces (refering to semantic bias scores [1][2], our implementation is based on [1])
TODO refer to branch/ other readme
TODO
[1] "The SAME score: Improved cosine based bias score for word embeddings", Arxiv Paper, IEEE IJCNN Paper
[2] "Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings", Arxiv Paper, NIPS Paper