Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interpretation of attention score #1

Open
rhjdzuDL opened this issue Jan 30, 2023 · 1 comment
Open

Interpretation of attention score #1

rhjdzuDL opened this issue Jan 30, 2023 · 1 comment

Comments

@rhjdzuDL
Copy link

rhjdzuDL commented Jan 30, 2023

Great work! It is very interesting to handle ViT compression from the perspective of low-frequency components.

I feel a little confused about the attention score (Eq. 8). The definition looks obscure and lacks necessary explanation.
I have read the code (and know what it does), but still cannot get the intuition. So, what exactly do these operations mean?

# attention scores
cls_attn_weight = attn[:, :, 1:, 0].mean(-1).reshape(B, -1, 1)
cls_attn = attn[:, :, 0, 1:].mul(cls_attn_weight).mean(1)
img_attn_weight = attn[:, :, 0, 0].reshape(B, -1, 1)
img_attn = attn[:, :, 1:, 1:].mean(2).mul(img_attn_weight).mean(1)
token_attn = cls_attn.add(img_attn)


By the way, there seems to be two mistakes in the paper compared with the code:

  • $\theta$$h, 0$ is not averaged by $N−1$
  • the two subscripts of $A^{l, h}$ seems reversed considering the code
@Daner-Wang
Copy link
Owner

Thank you for the comments.

We are sorry for missing the averaging term $\frac{1}{N-1}$ for $\theta_{h,0}$ in the manuscript. $A_{i,j}^{l,h}$ can be viewed as the attention value in the $i$-th column of the $j$-th row. We are sorry for this confusion.

The attention score in Eq. 8 combined with the CLS attention and the other attention. The CLS token is the final output used for classification in ViT, which is relatively more important than the other tokens. Thus, the CLS attention is separated from the attention matrix as a term and the average of the other attention is another term in the attention score. Considering that the head with denser and larger values is more important, the CLS attention and the other attention in different heads are weighted averaged by head-weights. For CLS attention, the head in which the correlation between CLS token and other tokens is higher should be more important, so the head-weight for CLS attention is computed as $\theta_{h,0}$. For other tokens, the head with a large $A_{0,0}^{l,h}$, i.e. the CLS token in Q is high correlated to that in K, has a more robust CLS token and is thus regarded to be more important. Hence, $\theta_{h,0}$ = $A_{0,0}^{l,h}$ is the head-weight for the other attention.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants