-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Does it support causal mask for GPT2-esque models? #7
Comments
@miguelvictor, I looked at the causal self-attention from mingpt,
` |
So I have thought about it a bit, and it turns out causal masking can be computed (more or less). If you apply a similar trick like the one describe in the performer paper (appendix B.1), you should be able to have causal masking. Basically you replace However there's another issue: your landmark selection can't really be causal (at least in the current setup). So if you can figure out some landmark selection algorithm that doesn't depend on the input sequence, you might be able to obtain a causal mask. You could obtain such a thing by making landmark selections a trainable parameter for example. All of this is in search of a causal masking that respects linear complexity. |
@thomasw21, the trick can be applied here for casual masking. But I am not sure if it will work out. For the landmarks, we do not have to use all the input sequence to compute it. It depends on the input sequence. I thought about using standard self-attention to compute first k tokens. Then we use first k tokens to compute landmarks. I am not sure if a reasonable k will work out. This needs experimental results to support. Adding trainable parameters to learn landmark selections is a reasonable way to do. While it introduces additional (linear) number of parameters, the landmarks (trainable parameters) will be fixed for the whole input dataset, which may lead to unfavorable performance. This also needs experimental results to verify. It will be interesting to see some results by applying the above schemes. |
I see two small issues here:
Yes I'd agree here. It's unclear to me if it would work. Yet if you look at Perceiver for example, they do have this trainable parameters in order to linearise their transformer.
+1 |
@thomasw21, the above scheme has those two issues. |
@miguelvictor, @thomasw21, One suitable way to apply Nystromformer to GPT2 and seq2seq model can be described here. For seq2seq, assume the first sequence length is n_1, and the second sequence length is n_2. The mask is to make (n_1 + n_ 2) x (n_1 + n_2) softmax matrix to be like (S_1, 0; S_2, S_3), where S_1 \in R^{n_1 x n_1}, S_2 \in R^{n_2 x n_1}, and the lower triangle matrix S_3 \in R^{n_2 x n_2}. In order to make Nystromformer work here, we apply Nystromformer to compute S_1 by using the first sequence, S_2 by using the first sequence and the second sequence, the lower triangle S_3 by using the second sequence with a mask similar to the above GPT application. In this way, we can fit it here. Note that we do not need to form the (n_1 + n_2) x (n_1 + n_2) softmax matrix and then do the multiplication with V. We can use the similar trick in Nystromformer for the matrix multiplication to compute output. In this way, it achieves a linear cost self-attention computation. |
Hi Yunyang, Thank you very much for this amazing work! I am still unclear about the causal mask implementation in Nystromformer. I think even if you use causal masks when computing kernel 1&2&3, in https://github.com/mlpen/Nystromformer/blob/main/LRA/code/attention_nystrom.py#L38, when you compute the landmarks, the causal condition is still violated. But for now, I don't have a good idea about designing a causal mask for nystromformer. Please correct me if I am wrong. Thank you again! |
For models with only Decoder-stacks, how to apply causal mask?
The text was updated successfully, but these errors were encountered: