Interpretation of attention score #1

rhjdzuDL · 2023-01-30T09:44:53Z

Great work! It is very interesting to handle ViT compression from the perspective of low-frequency components.

I feel a little confused about the attention score (Eq. 8). The definition looks obscure and lacks necessary explanation.
I have read the code (and know what it does), but still cannot get the intuition. So, what exactly do these operations mean?

VTC-LFC/models/deit/deit.py

Lines 41 to 46 in cfea7ee

    
           # attention scores 
        
           cls_attn_weight = attn[:, :, 1:, 0].mean(-1).reshape(B, -1, 1) 
        
           cls_attn = attn[:, :, 0, 1:].mul(cls_attn_weight).mean(1) 
        
           img_attn_weight = attn[:, :, 0, 0].reshape(B, -1, 1) 
        
           img_attn = attn[:, :, 1:, 1:].mean(2).mul(img_attn_weight).mean(1) 
        
           token_attn = cls_attn.add(img_attn)

By the way, there seems to be two mistakes in the paper compared with the code:

$\theta$_{$h, 0$} is not averaged by $N−1$
the two subscripts of $A^{l, h}$ seems reversed considering the code

Daner-Wang · 2023-02-17T14:35:32Z

Thank you for the comments.

We are sorry for missing the averaging term $\frac{1}{N-1}$ for $\theta_{h,0}$ in the manuscript. $A_{i,j}^{l,h}$ can be viewed as the attention value in the $i$-th column of the $j$-th row. We are sorry for this confusion.

The attention score in Eq. 8 combined with the CLS attention and the other attention. The CLS token is the final output used for classification in ViT, which is relatively more important than the other tokens. Thus, the CLS attention is separated from the attention matrix as a term and the average of the other attention is another term in the attention score. Considering that the head with denser and larger values is more important, the CLS attention and the other attention in different heads are weighted averaged by head-weights. For CLS attention, the head in which the correlation between CLS token and other tokens is higher should be more important, so the head-weight for CLS attention is computed as $\theta_{h,0}$. For other tokens, the head with a large $A_{0,0}^{l,h}$, i.e. the CLS token in Q is high correlated to that in K, has a more robust CLS token and is thus regarded to be more important. Hence, $\theta_{h,0}$ = $A_{0,0}^{l,h}$ is the head-weight for the other attention.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Interpretation of attention score #1

Interpretation of attention score #1

rhjdzuDL commented Jan 30, 2023 •

edited

Loading

Daner-Wang commented Feb 17, 2023

Interpretation of attention score #1

Interpretation of attention score #1

Comments

rhjdzuDL commented Jan 30, 2023 • edited Loading

Daner-Wang commented Feb 17, 2023

rhjdzuDL commented Jan 30, 2023 •

edited

Loading