-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
attention masks in each layer #5
Comments
Hi! Regarding your mention of the attention mask, Mamba does not require an explicit attention mask (similar to RNNs). Mamba achieves the masking of past tokens through causal 1D convolution, ensuring that the current token can only attend to previous tokens. This design aligns with the goal of "next-token prediction." For more details, please refer to the original Mamba paper. |
Thanks a lot for the help. |
Hi @hp-l33 , If I don't want the mamba model of AiM to predict the current token and only see the previous tokens, but want it to see all tokens, how should the mamba model be set up? |
Unfortunately, the naive Mamba used in AiM cannot attend to tokens bidirectionally. You might consider alternative approaches. |
Thanks~ |
Thank you for your work.
I noticed that you didn't seem to add a lower triangular matrix as
attention mask
in the mamba block to ensure that the current token only attends to the previous tokens. This appears to be inconsistent with "next-token prediction"Could you please explain the reason for this?
The text was updated successfully, but these errors were encountered: