attention masks in each layer #5

jiaosiyu1999 · 2024-09-25T05:42:21Z

Thank you for your work.
I noticed that you didn't seem to add a lower triangular matrix as attention mask in the mamba block to ensure that the current token only attends to the previous tokens. This appears to be inconsistent with "next-token prediction"
Could you please explain the reason for this?

The text was updated successfully, but these errors were encountered:

hp-l33 · 2024-09-25T08:25:34Z

Hi! Regarding your mention of the attention mask, Mamba does not require an explicit attention mask (similar to RNNs). Mamba achieves the masking of past tokens through causal 1D convolution, ensuring that the current token can only attend to previous tokens. This design aligns with the goal of "next-token prediction." For more details, please refer to the original Mamba paper.

jiaosiyu1999 · 2024-09-26T04:26:59Z

Thanks a lot for the help.

maxin-cn · 2024-11-17T07:46:45Z

Hi! Regarding your mention of the attention mask, Mamba does not require an explicit attention mask (similar to RNNs). Mamba achieves the masking of past tokens through causal 1D convolution, ensuring that the current token can only attend to previous tokens. This design aligns with the goal of "next-token prediction." For more details, please refer to the original Mamba paper.

Hi @hp-l33 , If I don't want the mamba model of AiM to predict the current token and only see the previous tokens, but want it to see all tokens, how should the mamba model be set up?

hp-l33 · 2024-11-20T01:45:53Z

Hi! Regarding your mention of the attention mask, Mamba does not require an explicit attention mask (similar to RNNs). Mamba achieves the masking of past tokens through causal 1D convolution, ensuring that the current token can only attend to previous tokens. This design aligns with the goal of "next-token prediction." For more details, please refer to the original Mamba paper.

Hi @hp-l33 , If I don't want the mamba model of AiM to predict the current token and only see the previous tokens, but want it to see all tokens, how should the mamba model be set up?

Unfortunately, the naive Mamba used in AiM cannot attend to tokens bidirectionally. You might consider alternative approaches.

maxin-cn · 2024-11-20T02:30:48Z

Hi! Regarding your mention of the attention mask, Mamba does not require an explicit attention mask (similar to RNNs). Mamba achieves the masking of past tokens through causal 1D convolution, ensuring that the current token can only attend to previous tokens. This design aligns with the goal of "next-token prediction." For more details, please refer to the original Mamba paper.

Hi @hp-l33 , If I don't want the mamba model of AiM to predict the current token and only see the previous tokens, but want it to see all tokens, how should the mamba model be set up?

Unfortunately, the naive Mamba used in AiM cannot attend to tokens bidirectionally. You might consider alternative approaches.

Thanks~

jiaosiyu1999 closed this as completed Sep 26, 2024

maxin-cn mentioned this issue Nov 19, 2024

How to enable Mamba2 to see all tokens when predicting the current token? state-spaces/mamba#624

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

attention masks in each layer #5

attention masks in each layer #5

jiaosiyu1999 commented Sep 25, 2024

hp-l33 commented Sep 25, 2024

jiaosiyu1999 commented Sep 26, 2024

maxin-cn commented Nov 17, 2024

hp-l33 commented Nov 20, 2024

maxin-cn commented Nov 20, 2024

attention masks in each layer #5

attention masks in each layer #5

Comments

jiaosiyu1999 commented Sep 25, 2024

hp-l33 commented Sep 25, 2024

jiaosiyu1999 commented Sep 26, 2024

maxin-cn commented Nov 17, 2024

hp-l33 commented Nov 20, 2024

maxin-cn commented Nov 20, 2024