Fix: Mamba2 `norm_before_gate` usage #32686

vasqu · 2024-08-14T15:12:50Z

What does this PR do?

Fixes the default value for norm_before_gate (False) in the default config of mamba2. Additionally, implemented the other variation with norm_before_gate = True. Currently, this only affects continuous training from codestral on the fast path with the fused kernel:

transformers/src/transformers/models/mamba2/modeling_mamba2.py

Lines 334 to 353 in 20a0449

    
           out, ssm_state = mamba_split_conv1d_scan_combined( 
        
               projected_states, 
        
               self.conv1d.weight.squeeze(1), 
        
               self.conv1d.bias, 
        
               self.dt_bias, 
        
               A, 
        
               D=self.D, 
        
               chunk_size=self.chunk_size, 
        
               seq_idx=None,  # was seq_idx 
        
               activation=self.activation, 
        
               rmsnorm_weight=self.norm.weight, 
        
               rmsnorm_eps=self.norm.variance_epsilon, 
        
               outproj_weight=self.out_proj.weight, 
        
               outproj_bias=self.out_proj.bias, 
        
               headdim=self.head_dim, 
        
               ngroups=self.n_groups, 
        
               norm_before_gate=self.norm_before_gate, 
        
               return_final_states=True, 
        
               **dt_limit_kwargs, 
        
           )

Discovered in #32580 (comment)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@molbap @ArthurZucker

vasqu · 2024-08-14T15:14:06Z

Tbh, we could also just remove norm_before_gate in the config and only allow the False path. I don't mind either.

vasqu · 2024-08-14T15:16:11Z

Will rebase later when the other fix(es) are in main 😄

molbap

Great, thanks a lot! I'm pro-keeping norm_before_gate as it's a valid config option in the original code and people could want to experiment on it. pinging @ArthurZucker for final review :)

molbap · 2024-08-14T15:30:12Z

src/transformers/models/mamba2/modeling_mamba2.py

@@ -248,7 +254,9 @@ def __init__(self, config: Mamba2Config, layer_idx: int):
        A = torch.arange(1, self.num_heads + 1)
        self.A_log = nn.Parameter(torch.log(A))
        self.A_log._no_weight_decay = True
-        self.norm = MambaRMSNormGated(self.intermediate_size, eps=self.layer_norm_epsilon)
+        self.norm = MambaRMSNormGated(
+            self.intermediate_size, eps=self.layer_norm_epsilon, norm_before_gate=config.norm_before_gate


great, maybe let's add this as an attribute of Mamba2Mixer in the init to get all config-derived args in the same place!

It's already been in the Mixer class, I just used it from the passed config. Changed it to use self now instead 👍

molbap · 2024-08-14T15:32:01Z

src/transformers/models/mamba2/modeling_mamba2.py


    def forward(self, hidden_states, gate=None):
        input_dtype = hidden_states.dtype
        hidden_states = hidden_states.to(torch.float32)

-        if gate is not None:
+        if gate is not None and not self.norm_before_gate:


ArthurZucker

Hey! Not entirely sure this is super transfomers friendly, as a research would just copy paste the modelling and add all the if-else!
our usualy motivation is to add a new model if there is a good pretrained checkpoint that demonstrated the use of this param!

If there is a strong demand from the community why not, but in general such changes go against the philosphy! 🤗

vasqu · 2024-08-19T14:19:55Z

Hmm, there are currently two problems in the current implementation though:

The norm_before_gate flag in the config was incorrectly set to True but the implementation follows the path as if it were False. This is problematic as it does not suggest so and enables incorrect codestral training via the fast path (see the initial description and the code snippet - direct passing of the flag).
Now that the flag exists, it would be weird that it only supports one path but suggests otherwise.

If I read it correctly, would you prefer removing this flag in the config and adjusting the code to only follow one path (i.e. norm_before_gate=False)?

ArthurZucker · 2024-08-20T12:38:10Z

If I read it correctly, would you prefer removing this flag in the config and adjusting the code to only follow one path (i.e. norm_before_gate=False)?

yes! 🤗 this would be less confusing IMO!

vasqu · 2024-08-20T14:06:03Z

@ArthurZucker Should be good now 👀

ArthurZucker

Thanks let's update the PR title

HuggingFaceDocBuilderDev · 2024-08-20T17:34:32Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

vasqu · 2024-08-20T19:54:54Z

I guess the renaming of the PR is too late now 😓

* mamba2 uses norm_before_gate=False * small nit * remove norm_before_gate flag and follow False path only

ArthurZucker · 2024-08-27T07:28:21Z

No worries 🤗

vasqu mentioned this pull request Aug 14, 2024

Mamba2 conversion script for original models #32580

Merged

5 tasks

molbap approved these changes Aug 14, 2024

View reviewed changes

ArthurZucker reviewed Aug 19, 2024

View reviewed changes

vasqu added 3 commits August 20, 2024 15:57

mamba2 uses norm_before_gate=False

b72e876

small nit

4f0ce72

remove norm_before_gate flag and follow False path only

7d01af0

vasqu force-pushed the mamba2-gated-norm-fix branch from 9cd9196 to 7d01af0 Compare August 20, 2024 13:58

ArthurZucker approved these changes Aug 20, 2024

View reviewed changes

ArthurZucker merged commit c63a3d0 into huggingface:main Aug 20, 2024
21 checks passed

vasqu deleted the mamba2-gated-norm-fix branch August 20, 2024 19:54

Titus-von-Koeller pushed a commit to jiqing-feng/transformers that referenced this pull request Aug 21, 2024

Fix: Mamba2 norm_before_gate usage (huggingface#32686)

cd8d48b

* mamba2 uses norm_before_gate=False * small nit * remove norm_before_gate flag and follow False path only

vasqu mentioned this pull request Aug 23, 2024

Update modeling_mamba2.py, fix pad size #32599

Merged

5 tasks

vasqu mentioned this pull request Sep 5, 2024

[Mamba2] Post Merge Fixes - norm_before_gate and generation with inputs_embeds sustcsonglin/flash-linear-attention#57

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: Mamba2 `norm_before_gate` usage #32686

Fix: Mamba2 `norm_before_gate` usage #32686

vasqu commented Aug 14, 2024

vasqu commented Aug 14, 2024

vasqu commented Aug 14, 2024

molbap left a comment

molbap Aug 14, 2024

vasqu Aug 14, 2024

molbap Aug 14, 2024

ArthurZucker left a comment

vasqu commented Aug 19, 2024

ArthurZucker commented Aug 20, 2024

vasqu commented Aug 20, 2024

ArthurZucker left a comment

HuggingFaceDocBuilderDev commented Aug 20, 2024

vasqu commented Aug 20, 2024 •

edited

Loading

ArthurZucker commented Aug 27, 2024

	out, ssm_state = mamba_split_conv1d_scan_combined(
	projected_states,
	self.conv1d.weight.squeeze(1),
	self.conv1d.bias,
	self.dt_bias,
	A,
	D=self.D,
	chunk_size=self.chunk_size,
	seq_idx=None, # was seq_idx
	activation=self.activation,
	rmsnorm_weight=self.norm.weight,
	rmsnorm_eps=self.norm.variance_epsilon,
	outproj_weight=self.out_proj.weight,
	outproj_bias=self.out_proj.bias,
	headdim=self.head_dim,
	ngroups=self.n_groups,
	norm_before_gate=self.norm_before_gate,
	return_final_states=True,
	**dt_limit_kwargs,
	)

Fix: Mamba2 norm_before_gate usage #32686

Fix: Mamba2 norm_before_gate usage #32686

Conversation

vasqu commented Aug 14, 2024

What does this PR do?

Before submitting

Who can review?

vasqu commented Aug 14, 2024

vasqu commented Aug 14, 2024

molbap left a comment

Choose a reason for hiding this comment

molbap Aug 14, 2024

Choose a reason for hiding this comment

vasqu Aug 14, 2024

Choose a reason for hiding this comment

molbap Aug 14, 2024

Choose a reason for hiding this comment

ArthurZucker left a comment

Choose a reason for hiding this comment

vasqu commented Aug 19, 2024

ArthurZucker commented Aug 20, 2024

vasqu commented Aug 20, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Aug 20, 2024

vasqu commented Aug 20, 2024 • edited Loading

ArthurZucker commented Aug 27, 2024

Fix: Mamba2 `norm_before_gate` usage #32686

Fix: Mamba2 `norm_before_gate` usage #32686

vasqu commented Aug 20, 2024 •

edited

Loading