Attributor is a library for evaluating and exploring attention attribution in Transformer networks.
Clone this repo. And navigate into it. Then, in your preferred virtual environment:
pip install -r requirements.txt
You have a burning question you want answered. You also want the answer to be based in data, so you collect a set of documents from sources you trust and think are relevant to the question topic. You don't have time to read them yourself, though, and enlist a friend.
You ask your friend to read the documents and then write down a succinct answer to your question and mail the answer to you.
Eventually you get the time to read your friend's letter. With a healthy level of skepticism, you wonder which documents your friend based their answer on.
You decide to ask them but, unfortunately, your friend is a Transformer.
Attribution is the problem of assigning a cause to an output. In Transformers, attention attribution attempts to assign which input (prompt) tokens caused each output (generated) token. It does this by mechanistically examining the flow of information through the Transformer from input tokens to output tokens. For now this repo only focuses on content agnostic attribution: we are only concerned with where information came from not what that information represents.
Let's get specific. There are only two operations in a (sufficiently vanilla, decoder-only) Transformer that mix information across token positions (a.k.a. residual streams): the attention blocks and residual connections.
Consider a sequence of
In the current implementation we are only interested in how the information from each input position flows to each output position. Each layer simply mixes the previous layer's features according to the attention weights and residual connections (note: the post-MLP residual will get normalized out). Specifically, let
Finally, at the output layer
Above is a mechanistic interpretation of how information flows from input token positions to output token positions in Transformers. Obviously we are glossing over most of the transformations in a Transformer. Is information flow sufficient to attribute output tokens to input tokens in a human interpretable way? To try and answer that question, we'll evaluate the above approach over document question-answering datasets.
Given a document question-answering dataset, Attributor can be used to perform data-driven evaluation of attention attribution in any sufficiently vanilla (GPT, Llama, etc) decoder-only Huggingface Transformer.
See hotpot_qa for an implementation over the HotPotQA dataset.
python hotpot_qa.py --model your/favorite-hf-model
For example
python hotpot_qa.py --model HuggingFaceTB/SmolLM-135M-Instruct --dtype float16 --device_map cuda
After running an evaluation script (e.g. hotpot_qa.py), you can explore the data via an interactive web UI with the following command:
python explorer.py
Then open a browser and navigate to localhost:3000. Open a file and hit Load
to explore!
Included is a interactive visualization of the attention flow through the network. Run the python sever
cd interactive
uvicorn --reload server.app:app
Then run the UI
cd webapp
npm start
Visit localhost:3000. Enter a Huggingface model ID (currently must be a Llama-derived model), enter a device_map (e.g. 'cuda'), a precision (e.g. 'bfloat16') and max tokens (set a high number, it is unused right now)