Refusal in LLMs

This project investigates how AT (adversarial training) and LAT (latent adversarial training) impact LLama 2 7B's ability to refuse harmful requests after refusal direction ablation attack. We also look into how the representations change post AT & LAT fine-tuning.

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
data_store		data_store
datasets		datasets
experiments		experiments
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Refusal in LLMs

About

Releases

Packages

Contributors 3

Languages

nlpet/apart-lab-refusal-lat-project

Folders and files

Latest commit

History

Repository files navigation

Refusal in LLMs

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages