This project investigates how AT (adversarial training) and LAT (latent adversarial training) impact LLama 2 7B's ability to refuse harmful requests after refusal direction ablation attack. We also look into how the representations change post AT & LAT fine-tuning.
-
Notifications
You must be signed in to change notification settings - Fork 0
nlpet/apart-lab-refusal-lat-project
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
About
Apart Lab Refusal Attacks Project
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published