Skip to content

dannyallover/causal-tracing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Causal Tracing

We perform causal tracing as described in the ROME paper by Meng et al.

Experiments

Gaussian Noise Subject Corruption

note: percent improvement := $p_{*,h_{i}^{l}}(token) - p_{*}(token) / |p(token) - p_{*}(token)|$.
Indirect Effect on 100 examples.
Indirect Effect 0 cutoff on 100 examples.
Percent Improvement on 100 examples.
Percent Improvement 0 cutoff on 100 examples.
Average Indirect Effect on 500 random examples. [compare this to the results in ROME paper]
Average Indirect Effect on 1000 random examples.
Standard Deviation of Indirect Effect on 1000 random examples.
Average Percent Improvement on 1000 random examples.
Standard Deviation of Percent Improvement on 1000 random examples.
Indirect Effect 100 examples least amount of tokens.
Indirect Effect 100 examples most amount of tokens.
Average Indirect Effect 100 examples least amount of tokens.
Average Indirect Effect 100 examples most amount of tokens.

observations: I was curious to see how the indirect effect on particular examples varied on GPT-J vs GPT2-XL, so I sampled the same 10 examples for both models. GPT-J seems to have cleaner paintings, higher magnitudes, and less often goes in the opposite direction (I wonder how much this is a consequence of better tuned sd). The average indirect effect seems to align pretty well with the results in the ROME paper. When analyzing the average percent improvement, there seems to be higher activation at the middle subject token and first subsequent token when compared to the plot of the average indirect effect. Additionally the standard deviation of the indirect effect seems to correlate with the average indirect effect; however, there seems to be less of a correlation with the standard deviation of the percent improvement with the average percent improvement. As we observed in gpt2-xl, the more tokens in the prompt, the less the early site/late site idea holds as observed in the ROME paper; the number of tokens also correlates with the magnitude of the indirect effect.

Gaussian Noise Subject Corruption [Testing Robustness of ROME Results]

Indirect effect for ROME examples. [compare this to the results in the ROME paper]
Indirect Effect on 100 examples.
Indirect Effect 0 cutoff on 100 examples.
Percent Improvement on 100 examples.
Percent Improvement 0 cutoff on 100 examples.
Average Indirect Effect on 1000 examples. [compare this to the results in the ROME paper]
Standard Deviation of Indirect Effect on 1000 random examples.
Average Percent improvement on 1000 random examples.
Standard Deviation of Percent Improvement on 1000 random examples.
Indirect Effect 100 examples least amount of tokens.
Indirect Effect 100 examples most amount of tokens.
Average Indirect Effect 100 examples least amount of tokens.
Average Indirect Effect 100 examples most amount of tokens.

observations: The results tend to match up, albeit it's not exactly perfect (the magnitude doesn't always perfectly align). I think it's close enough where the differences can be attributed to the randomness. We tested this by running the zillow example 50 times which can be found here, and it seems to be the case that the difference in magnitude is due to the variance in noise. Also an interesting phenomena: the more tokens in the prompt, the less the early site/late site idea holds as observed in the ROME paper; the number of tokens also correlates with the magnitude of the indirect effect.

[note: the below experiments were run with two differences in the implementation that the ROME folks use: (1) different gaussian noise to the subject when patching each site and (2) we only used 1 run instead of 10 on each example]

Gaussian Noise Subject Corruption

We took 100 examples, and for each example we corrupt the subject by adding gaussian noise. We then perform causual tracing, restoring each state with its non-corrupted counterpart. Here are the complete results of the indirect effect, $p_{*,h_{i}^{l}}(token) - p_{*}(token)$, on 100 examples. Here are the results of the average indirect effect across the 100 examples. Here is the standard deivation of the indirect effect across the 100 examples.

observations: Some examples match the last subject token/late site phenomena pretty well. Some examples have high indirect effect at the second to last subject token. Some examples seem to have no consistency. The results from averaging the indirect effect across 100 examples seems to align with the results from the ROME paper (it's not quite the same but perhaps that's because we didn't use 1000 examples). The magnitude of the standard deviation seems to correlate with the early/late site phenomena (not sure what to make of that).

Gaussian Noise Subject Corruption (Patching Good State with Bad State)

We took 100 examples and corrupted the subject by adding gaussian noise as we did in the previous experiment; then we saved the states. Instead of running the prompt with the corrupted subject through the model and patching with the non-corrupted states, we instead do the opposite. That is, we run the vanilla prompt through the model and patch each state with its corresponding corrupted state. Here are the results of measuring the indirect effect, $p_{h*_{i}^{l}}(token) - p(token)$, across the 100 examples. Here are the results of the average indirect effect across the 100 examples. Here is the standard deivation of the indirect effect across the 100 examples.

observations: Like the previous experiment, some examples match the last subject token/late site phenomena pretty well. Also like before, some examples have high indirect effect at the second to last subject token. Unlike the last experiment, we actually don't have any examples where patching the states has no consistency (I find this to be very interesrting, and I cannot explain why). Also, another naunce in this expiriment, is that it seems that more of the subject tokens have a role (rather than just strictly the last or second to last token). Another peculiarity of this experiment is that it seems that the indirect effect of the earlier tokens seem to go in the opposite direction while the indirect effect effect fo the later tokens are in the expected direction. Lastly, the average indirect effect more consistnetly follows the results in the ROME paper. The standard deviation also much more consistently correlates with the average indirect effect.

Addendum on Above Two Experiments: Indirect Effect Wrong Direction

What I've observed in the above two experiments (especially when we patch the good state with the bad), is that the indirect effect goes in the opposite direction. Here is the percentage at which each site goes in the opposite direction for the first experiment. I'm not sure why the indirect effect would go in the opposite direction; however, there is a pattern to it: the indirect effect is less likely to go in the opposite direction at the last subject token/late site phenomenom. This could be explained by the fact that these sites are critical in predicting the token, so patching these areas lead to improvement more often. Here are the same results for the second experiment. This also follows the last subject/late site pattern, but there are two additional naunces here: the first subject token is much more vulnerable in going in the opposite direction (I noted this in the observations in experiment two), and also the very last site will never go in the opposite direction.

Varying Gaussian Noise Subject Corruption

We sought to find out the effect of varying the amount of guassian noise added to the subject. Here are the results of the indirect effect on 10 examples where we vary the standard deviation by the following amounts: [0.000001, 0.001, 0.01, 0.1, 0.5, 1, 1.5, 2.5, 5, 10, 100, 1000000].

observations: What we see is what we expect: adding little guassian noise has less effect on predicting the token vs adding a large amount of guassian noise. In fact, when you add a large amount of guassian noise, patching does not really make a difference.

Random Gaussian Embedding Subject Corruption

Another experiment we performed was to replace the subject with a random guassian embedding. Here are the results of the indirect effect on 100 examples. Here is the average of the indirect effect across the 100 examples. Here is the standard deviation at each site.

observations: The results show that the last subject token/late site phenomenon is still present, although it is more subtle.

Shuffling Embeddings of the Subject

The last non-prefix experiment we performed was to shuffle the subject embeddings. Here are the results of the indirect effect on 100 examples. Here is the average of the indirect effect across the 100 examples. Here is the standard deviation at each site.

observations: While the previous experiment still obeyed the phenomenom we've been seeing, this experiment finally breaks it. We see no consistency in both the average indirect effect and the standard deviation. It is probably the case that shuffling subject embeddings is more confusing (adversarial, let's say) than simply replacing with something random.

[deprioritized] Adding Non-Confusing Prefix to the Subject

Prefix with False Facts

We use the following non-cofusing prefix: 'Beats Music is owned by Apple. Audible.com is owned by Amazon. Catalonia belongs to the continent of Europe.'. And we use the following confusing prefix: 'Beats Music is owned by Microsoft. Audible.com is owned by Google. Catalonia belongs to the continent of America.' We then get internal states for the concatenation [non-confusing prefix; prompt] and get internal states for the concatenation [confusing prefix; prompt], and for each internal state, replace its value in the second by its value in the first at each site. Here are the results of the indirect effect at each site for 25 examples. Here are the results of the average indirect effect across the 25 examples.

observations: The first notable thing we see is that there is high indirect effect at the facts. We also see something very interesting: the earlier the fact is in the prefix, the less impact it has, while the later the fact is in the prefix the more of an impact it has.

Addendum on Above Experiment: Position of Fact

One experiment we can try is to take 5 facts and use each one as an individual prefix before a prompt, observing the indirect effect of that fact. Then to test the positional effect of prefixes we can take all 5 facts and use them as one prefix. Observe fact1, fact2, fact3, fact4, fact5 and then [fact1; fact2; fact3; fact4; fact5] stacked together. [I believe it is necessary to use the same 25 examples for this experiment, which I can't recall if I did, so need to redo this]

Prefix with False Facts (patch non-confusing with confusing)

We use the following non-cofusing prefix: 'Beats Music is owned by Apple. Audible.com is owned by Amazon. Catalonia belongs to the continent of Europe.'. And we use the following confusing prefix: 'Beats Music is owned by Microsoft. Audible.com is owned by Google. Catalonia belongs to the continent of America.' We then get internal states for the concatenation [non-confusing prefix; prompt] and get internal states for the concatenation [confusing prefix; prompt], and for each internal state (unlike the last experiment), replace its value in the first by its value in the second at each site. Here are the results of the indirect effect at each site for 25 examples. Here are the results of the average indirect effect across the 25 examples.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published