You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Great work! It is very interesting to handle ViT compression from the perspective of low-frequency components.
I feel a little confused about the attention score (Eq. 8). The definition looks obscure and lacks necessary explanation.
I have read the code (and know what it does), but still cannot get the intuition. So, what exactly do these operations mean?
We are sorry for missing the averaging term $\frac{1}{N-1}$ for $\theta_{h,0}$ in the manuscript. $A_{i,j}^{l,h}$ can be viewed as the attention value in the $i$-th column of the $j$-th row. We are sorry for this confusion.
The attention score in Eq. 8 combined with the CLS attention and the other attention. The CLS token is the final output used for classification in ViT, which is relatively more important than the other tokens. Thus, the CLS attention is separated from the attention matrix as a term and the average of the other attention is another term in the attention score. Considering that the head with denser and larger values is more important, the CLS attention and the other attention in different heads are weighted averaged by head-weights. For CLS attention, the head in which the correlation between CLS token and other tokens is higher should be more important, so the head-weight for CLS attention is computed as $\theta_{h,0}$. For other tokens, the head with a large $A_{0,0}^{l,h}$, i.e. the CLS token in Q is high correlated to that in K, has a more robust CLS token and is thus regarded to be more important. Hence, $\theta_{h,0}$ = $A_{0,0}^{l,h}$ is the head-weight for the other attention.
Great work! It is very interesting to handle ViT compression from the perspective of low-frequency components.
I feel a little confused about the attention score (Eq. 8). The definition looks obscure and lacks necessary explanation.
I have read the code (and know what it does), but still cannot get the intuition. So, what exactly do these operations mean?
VTC-LFC/models/deit/deit.py
Lines 41 to 46 in cfea7ee
By the way, there seems to be two mistakes in the paper compared with the code:
The text was updated successfully, but these errors were encountered: