Starcoder2 model #29120

jlamypoirier · 2024-02-19T19:46:02Z

The Starcoder2 model, adapted from Mistral. All changes are done through options, so Mistral itself is still supported. Main changes:

Use layer norm (RMS still available as option)
Use standard MLP (gated still available as option)
Add back biases (optional)
Change (default?) tokenizer class
*Embedding and residual dropout

It does not support absolute embeddings, so can't support Santacoder or Starcoder

Todo:

Forward changes from [Core generation] Adds support for static KV cache #27931, [CLeanup] Revert SDPA attention changes that got in the static kv cache PR #29027 (and future changes from Feb. 19)
Documentation
Copyright
Point to starcoder2 checkpoint
Other minor things (see todos)

@younesbelkada

ArthurZucker

🔥 looks very good!

Let's not support mistral
Let's try to take in the new API from Fix static generation when compiling! #28937 and [Core generation] Adds support for static KV cache #27931
The rest is pretty much alright!

ArthurZucker · 2024-02-20T00:30:43Z

src/transformers/models/starcoder2/modeling_starcoder2.py

+        return self.weight * hidden_states.to(input_dtype)
+
+
+# Copied from transformers.models.llama.modeling_llama.LlamaRotaryEmbedding with Llama->Starcoder2


fix copies will not let this pass, should be copied from Mistral as we changed llama for compiled static cache.
I would also rather we support static cache as the API got quite a lot cleaner

ArthurZucker · 2024-02-20T00:31:05Z

src/transformers/models/starcoder2/modeling_starcoder2.py

+    return torch.cat((-x2, x1), dim=-1)
+
+
+# Copied from transformers.models.llama.modeling_llama.apply_rotary_pos_emb


same here llama is different make fix-copies will help you fix this !

ArthurZucker · 2024-02-20T00:31:38Z

src/transformers/models/starcoder2/modeling_starcoder2.py

+        return hidden_states
+
+
+class Starcoder2GatedMLP(nn.Module):


probably missing copied from mention here (mistral)

It has small changes (bias + dropout I think)

Should we remove the copied mention from all the classes/methods where we added dropout?

transformers/src/transformers/models/starcoder2/modeling_starcoder2.py

Line 262 in 7fac7d8

# Copied from transformers.models.mistral.modeling_mistral.MistralAttention.forward

transformers/src/transformers/models/starcoder2/modeling_starcoder2.py

Line 346 in 7fac7d8

# Copied from transformers.models.mistral.modeling_mistral.MistralFlashAttention2 with Mistral->Starcoder2

transformers/src/transformers/models/starcoder2/modeling_starcoder2.py

Line 974 in 7fac7d8

# Copied from transformers.models.mistral.modeling_mistral.MistralModel.forward with MISTRAL->STARCODER2,Mistral->Starcoder2

yes otherwise the check-copies will fail 😉

ArthurZucker · 2024-02-20T00:32:03Z

src/transformers/models/starcoder2/modeling_starcoder2.py

+    def _shape(self, tensor: torch.Tensor, seq_len: int, bsz: int):
+        return tensor.view(bsz, seq_len, self.num_heads, self.head_dim).transpose(1, 2).contiguous()
+


Suggested change

def _shape(self, tensor: torch.Tensor, seq_len: int, bsz: int):

return tensor.view(bsz, seq_len, self.num_heads, self.head_dim).transpose(1, 2).contiguous()

this is not used in Mistral anyways

ArthurZucker · 2024-02-20T00:32:50Z

src/transformers/models/starcoder2/modeling_starcoder2.py

+    return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen, head_dim)
+
+
+class Starcoder2Attention(nn.Module):


would make sense to follow the llama implementation IMO for static cache (with the additional cache positions) but this can go in another PR no worries 🤗

ArthurZucker · 2024-02-20T00:35:16Z

src/transformers/models/starcoder2/modeling_starcoder2.py

+        self.self_attn = STARCODER2_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx)
+
+        self.mlp = STARCODER2_MLP_CLASSES[config.mlp_type](config)
+
+        self.input_layernorm = STARCODER2_NORMALIZATION_CLASSES[config.norm_type](
+            config.hidden_size, eps=config.norm_epsilon
+        )
+        self.post_attention_layernorm = STARCODER2_NORMALIZATION_CLASSES[config.norm_type](
+            config.hidden_size, eps=config.norm_epsilon
+        )


this is not what we usually do in transformers. The attention is a specific case 😅

are all of these used in the default starcoder?

if not then let's not support mistral. Mistral is a different architecture
The reason why attention is allowed is because it uses the same parameters -> same "Attention" with different forward vs here it's really a different architecture = against transformers philosophy

ArthurZucker · 2024-02-20T00:36:06Z

src/transformers/models/starcoder2/modeling_starcoder2.py

+        if self._attn_implementation == "flash_attention_2":
+            # 2d mask is passed through the layers
+            attention_mask = attention_mask if (attention_mask is not None and 0 in attention_mask) else None
+        elif self._attn_implementation == "sdpa" and not output_attentions:
+            # output_attentions=True can not be supported when using SDPA, and we fall back on
+            # the manual implementation that requires a 4D causal mask in all cases.
+            attention_mask = _prepare_4d_causal_attention_mask_for_sdpa(
+                attention_mask,
+                (batch_size, seq_length),
+                inputs_embeds,
+                past_key_values_length,
+            )
+        else:
+            # 4d mask is passed through the layers
+            attention_mask = _prepare_4d_causal_attention_mask(
+                attention_mask,
+                (batch_size, seq_length),
+                inputs_embeds,
+                past_key_values_length,
+                sliding_window=self.config.sliding_window,
+            )


see the new Llama code for this which was simpliefied. I'd rather we take it directly for the attention 😉

ArthurZucker · 2024-02-20T00:37:27Z

tests/models/starcoder2/test_modeling_starcoder2.py

+    @unittest.skip("Starcoder2 buffers include complex numbers, which breaks this test")
+    def test_save_load_fast_init_from_base(self):
+        pass


I might have missed this but have not seen where these complex number buffers are?

RaymondLi0 · 2024-02-22T14:59:03Z

I re-created a PR here since Joel is on vacation: #29215
Sorry for the inconvenience

github-actions · 2024-03-21T08:03:21Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

ArthurZucker · 2024-03-25T08:59:20Z

Closing as #29215 was merged and starcoder 2 is officially supported

jlamypoirier and others added 7 commits January 10, 2024 15:20

Copy model

81bcfbd

changes

4f2df8e

misc

5b88238

fixes

e0ec999

add embed and residual dropout (huggingface#30)

4983a75

Merge branch 'hf_main' into starcoder2

65f9c26

misc

7fac7d8

ArthurZucker reviewed Feb 20, 2024

View reviewed changes

Vaibhavs10 mentioned this pull request Feb 28, 2024

Starcoder2 Support ml-explore/mlx-examples#500

Closed

Muhtasham mentioned this pull request Feb 28, 2024

Add Starcoder 2 ml-explore/mlx-examples#502

Merged

ArthurZucker closed this Mar 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Starcoder2 model #29120

Starcoder2 model #29120

jlamypoirier commented Feb 19, 2024 •

edited

Loading

ArthurZucker left a comment

ArthurZucker Feb 20, 2024

ArthurZucker Feb 20, 2024

ArthurZucker Feb 20, 2024

jlamypoirier Feb 20, 2024

RaymondLi0 Feb 21, 2024

ArthurZucker Feb 22, 2024

ArthurZucker Feb 20, 2024

ArthurZucker Feb 20, 2024

ArthurZucker Feb 20, 2024

ArthurZucker Feb 20, 2024

ArthurZucker Feb 20, 2024

RaymondLi0 commented Feb 22, 2024 •

edited

Loading

github-actions bot commented Mar 21, 2024

ArthurZucker commented Mar 25, 2024

		return self.weight * hidden_states.to(input_dtype)


		# Copied from transformers.models.llama.modeling_llama.LlamaRotaryEmbedding with Llama->Starcoder2

		return torch.cat((-x2, x1), dim=-1)


		# Copied from transformers.models.llama.modeling_llama.apply_rotary_pos_emb

		def _shape(self, tensor: torch.Tensor, seq_len: int, bsz: int):
		return tensor.view(bsz, seq_len, self.num_heads, self.head_dim).transpose(1, 2).contiguous()

		return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen, head_dim)


		class Starcoder2Attention(nn.Module):

Starcoder2 model #29120

Starcoder2 model #29120

Conversation

jlamypoirier commented Feb 19, 2024 • edited Loading

ArthurZucker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RaymondLi0 commented Feb 22, 2024 • edited Loading

github-actions bot commented Mar 21, 2024

ArthurZucker commented Mar 25, 2024

jlamypoirier commented Feb 19, 2024 •

edited

Loading

RaymondLi0 commented Feb 22, 2024 •

edited

Loading