Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question regarding RoPE implementation #10

Open
lxy1108 opened this issue Nov 8, 2024 · 7 comments
Open

Question regarding RoPE implementation #10

lxy1108 opened this issue Nov 8, 2024 · 7 comments

Comments

@lxy1108
Copy link

lxy1108 commented Nov 8, 2024

Thanks for sharing the code of this interesting work!

I noticed an inconsistency between the paper’s description of the RoPE and its implementation in the code. According to the paper, the relative position should be calculated based on the temporal differences between patches,
image

However, the code seems to be using flattened 2D indices instead when applying RoPE:

seq_id = torch.arange(n_tokens * n_vars)
seq_id = repeat(seq_id, 'n -> b h n', b=B, h=H)
queries, keys = self.qk_proj(queries, keys, query_id=seq_id, kv_id=seq_id)

Could you clarify the reasoning behind this discrepancy? Was this an intentional change, or might it affect the model’s performance?

@WenWeiTHU
Copy link
Collaborator

WenWeiTHU commented Nov 8, 2024

Update: changing "permutation-invariant between variables" to "permutation-equivariant between variables"
Thanks for your valuable issue, the current 2d RoPE implementation in this repo is flattened on 2D indices (which is mainly to be compatible with the implementation of Moirai's official repo).

After our discussion, we believe that it is reasonable to flatten out one-dimensional indices of RoPE to ensure the permutation-equivariant between variables. We are arranging relevant experiments, and we'll see how that affects the performance. Please stay tuned and thanks again for your insightful question.

@lxy1108
Copy link
Author

lxy1108 commented Nov 8, 2024

Thanks for your response! I am looking forward to seeing the relevant experiments :)

@lxy1108
Copy link
Author

lxy1108 commented Nov 8, 2024

Thanks for your valuable issue, the current 2d RoPE implementation in this repo is flattened on 2D indices (which is mainly to be compatible with the implementation of Moirai's official repo).

After our discussion, we believe that it is reasonable to flatten out one-dimensional indices of RoPE to ensure the permutation-invariance between variables. We are arranging relevant experiments, and we'll see how that affects the performance. Please stay tuned and thanks again for your insightful question.

Thank you again for the detailed response! Could you elaborate on how this flattened RoPE achieves permutation invariance between variables?

@WenWeiTHU
Copy link
Collaborator

WenWeiTHU commented Nov 12, 2024

Update: Sorry for the misunderstanding, we have changed the "permutation-invariant for variables" to "permutation-equivariant for variables" (For their difference, please refer to https://arxiv.org/abs/1703.06114). We will also revise this term in our paper.

This is a very interesting and important question!

Note that permutation equivariant between variables means shuffling the input order of variables should not affect anything other than the output order of variables.

For example, if we have $T$ patch tokens for $N$ variables each. Now, we are going to mark the temporal position of these $NT$ tokens. Instead of using 2D RoPE: $R_{NT, NT}$, we think it is more reasonable to use 1D $R_{T, T}$ and repeat it by using the Kronecker Product: $I_{N\times N} \bigotimes R_{T, T}$. And thus $N$ will not influence the element values in varying sub-matrice (working for different variable pairs).

@WenWeiTHU
Copy link
Collaborator

WenWeiTHU commented Nov 12, 2024

Further, to mark the variable position of these $NT$ tokens, what we do is only distinguish whether the token belongs to the variable or not. More concisely, if token A belongs to variable 1 and does not belong to variable 2 and variable 3, we only care that V1 is different from V2/V3, but we are not concerned that V2 is different from V3 (if we do so, change the input order of V2 and V3 will cause additional effect to the output). Therefore, for any token, confirming its endogenous and exogenous variables is enough. That is why we use the term "keep the equivariance of variables" by $u$ and $v$ in our work.

@lxy1108
Copy link
Author

lxy1108 commented Nov 20, 2024

Thank you for your response! I agree that using 1D RoPE and repeating it is a more reasonable approach compared to the flattened 2D RoPE.

@Leopold2333
Copy link

Update: Sorry for the misunderstanding, we have changed the "permutation-invariant for variables" to "permutation-equivariant for variables" (For their difference, please refer to https://arxiv.org/abs/1703.06114). We will also revise this term in our paper.

This is a very interesting and important question!

Note that permutation equivariant between variables means shuffling the input order of variables should not affect anything other than the output order of variables.

For example, if we have T patch tokens for N variables each. Now, we are going to mark the temporal position of these N T tokens. Instead of using 2D RoPE: R N T , N T , we think it is more reasonable to use 1D R T , T and repeat it by using the Kronecker Product: I N × N ⨂ R T , T . And thus N will not influence the element values in varying sub-matrice (working for different variable pairs).

I got a confusion about the size of RoPE matrix. The code implementation uses an $NT$-length 1D RoPE matrix.

First I wanna figure one thing out: For a token $x_{m,i}$, should the tokens interacting with it be $x_{n,j} (\forall n\ne m, j\le i)$ ?

If so, the angle for $x_{m,i}$ will be $(mT+i)\Theta$, while the angles for $x_{n,j}$ will be $(nT+j)\Theta$. I'm not quite sure whether it's necessary to ensure that the tokens of any variable with the same patch-index should maintain the same angle, so that even if m≠n, the tokens at indices i and j will still preserve the same relative positional difference of (i-j)Θ? This also seems to achieve permutation invariance between variables.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants