Mechanistic Interpretability Projects

This repository houses research projects in mechanistic interpretability: reverse engineering neural networks.

Understanding bracket closing in GPT-Neo

The goal of this notebook is to explore the phenomenon of bracket closing in the GPT-Neo 125M model, whereby it can correctly match open parentheses ([{< with their corresponding closing versions )]}>.

This is Problem 2.13 in Neel Nanda's 200 Concrete Open Problems in Mechanistic Interpretability. The first goal is to figure out how the model determines whether an opening or closing bracket is more appropriate, and the second is to figure out how it knows the correct kind: (, [, { or <.

Playground

Some notebooks messing around with models.

Reducing transformer embedding dimensionality
Regressing on transformer embeddings
Doing SVD on positional embeddings
Playing around with the 'Toy Models of Superposition' paper

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Mechanistic Interpretability Projects

Understanding bracket closing in GPT-Neo

Playground

Files

README.md

Latest commit

History

README.md

File metadata and controls

Mechanistic Interpretability Projects

Understanding bracket closing in GPT-Neo

Playground