Refusal in LLMs

This project investigates how AT (adversarial training) and LAT (latent adversarial training) impact LLama 2 7B's ability to refuse harmful requests after refusal direction ablation attack. We also look into how the representations change post AT & LAT fine-tuning.