Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

attention masks in each layer #5

Closed
jiaosiyu1999 opened this issue Sep 25, 2024 · 5 comments
Closed

attention masks in each layer #5

jiaosiyu1999 opened this issue Sep 25, 2024 · 5 comments

Comments

@jiaosiyu1999
Copy link

Thank you for your work.
I noticed that you didn't seem to add a lower triangular matrix as attention mask in the mamba block to ensure that the current token only attends to the previous tokens. This appears to be inconsistent with "next-token prediction"
Could you please explain the reason for this?

@hp-l33
Copy link
Owner

hp-l33 commented Sep 25, 2024

Hi! Regarding your mention of the attention mask, Mamba does not require an explicit attention mask (similar to RNNs). Mamba achieves the masking of past tokens through causal 1D convolution, ensuring that the current token can only attend to previous tokens. This design aligns with the goal of "next-token prediction." For more details, please refer to the original Mamba paper.

@jiaosiyu1999
Copy link
Author

Thanks a lot for the help.

@maxin-cn
Copy link

Hi! Regarding your mention of the attention mask, Mamba does not require an explicit attention mask (similar to RNNs). Mamba achieves the masking of past tokens through causal 1D convolution, ensuring that the current token can only attend to previous tokens. This design aligns with the goal of "next-token prediction." For more details, please refer to the original Mamba paper.

Hi @hp-l33 , If I don't want the mamba model of AiM to predict the current token and only see the previous tokens, but want it to see all tokens, how should the mamba model be set up?

@hp-l33
Copy link
Owner

hp-l33 commented Nov 20, 2024

Hi! Regarding your mention of the attention mask, Mamba does not require an explicit attention mask (similar to RNNs). Mamba achieves the masking of past tokens through causal 1D convolution, ensuring that the current token can only attend to previous tokens. This design aligns with the goal of "next-token prediction." For more details, please refer to the original Mamba paper.

Hi @hp-l33 , If I don't want the mamba model of AiM to predict the current token and only see the previous tokens, but want it to see all tokens, how should the mamba model be set up?

Unfortunately, the naive Mamba used in AiM cannot attend to tokens bidirectionally. You might consider alternative approaches.

@maxin-cn
Copy link

Hi! Regarding your mention of the attention mask, Mamba does not require an explicit attention mask (similar to RNNs). Mamba achieves the masking of past tokens through causal 1D convolution, ensuring that the current token can only attend to previous tokens. This design aligns with the goal of "next-token prediction." For more details, please refer to the original Mamba paper.

Hi @hp-l33 , If I don't want the mamba model of AiM to predict the current token and only see the previous tokens, but want it to see all tokens, how should the mamba model be set up?

Unfortunately, the naive Mamba used in AiM cannot attend to tokens bidirectionally. You might consider alternative approaches.

Thanks~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants