This repository contains some initial code for the paper Interpretability in Action: Exploratory Analysis of VPT, a Minecraft Agent, which appeared at the Mechanistic Interpretability Workshop at ICML 2024. The main codebase is in the "research code" state at the moment, and we will do our best to share it if there is enough interest.
Meanwhile, feel free to play around with these:
Attention outputs dashboard Colab
Attention weights interactive notebook
The data for the Attention weights notebook can be found here.
Please cite the paper using the below BibTeX:
@article{jucys2024vptmi,
title={Interpretability in Action: Exploratory Analysis of VPT, a Minecraft Agent},
author={Jucys, Karolis and Adamopoulos, George and Hamidi, Mehrab and Milani, Stephanie and Samsami, Mohammad Reza and Zholus, Artem and Joseph, Sonia and Richards, Blake and Rish, Irina and {\c{S}}im{\c{s}}ek, {\"O}zg{\"u}r},
journal={arXiv preprint arXiv:2407.12161},
url={https://arxiv.org/abs/2407.12161},
year={2024}
}