-
Notifications
You must be signed in to change notification settings - Fork 922
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Starcoder 2 #502
Add Starcoder 2 #502
Conversation
@awni I haven't had time to make progress yet, but this solution looks like it's the right direction |
@awni not yet hf is down 😞 but when trying to generate locally despite files being there and import working getting
but when I do in python shell I do not get that error but hf down error, is this behaviour intended @lazarust happy to collaborate, this is still draft based on original mistral implementation I think we need some more changes from huggingface/transformers#29120 |
No it's not. Make sure you do not have another MLX LM installed ( |
qq @awni is gelu_pytorch_tanh equivalnet in mlx is def __call__(self, x) -> mx.array:
return self.w2(nn.gelu(self.w1(x)) * self.w3(x)) |
@Muhtasham the tanh approximation in PyTorch is not strictly the same as |
How is this going? Were you able to generate any sensible code yet? |
Having some,
I went through model config nothing seems extra any hints @awni ? command python -m mlx_lm.convert \
--hf-path bigcode/starcoder2-3b \
-q \
--upload-repo mlx-community/starcoder2-3b-4bit |
I suggest either referring to the transformers implementation if you are familiar with the codebase, or loading the model weights and checking the weight names which will give you a hint of the model structure. |
And please note that the startcoder2 seems to have enabled the bias weight for all the linear layers, so you may need to enable it in nn.Linear. |
Thanks for the help here @mzbac !! @Muhtasham after those fixes is it working? |
Yeah, basically it's just a different way to structure the prompt and allow the model to autocomplete for middle portion. A more practical example is similar to using GitHub Copilot, where the model will fill in the content at the cursor position. |
Ok, I'm a little confused as to what to do with this. The model doesn't generate sensible outputs, so is the plan to land it so people can use LoRA / fine-tunes? Or do we need to fix the prompt? Or is there some other thing that I'm missing? |
I will do some testing tonight. In terms of fine-tuning, my understanding is that the FIM model is just different in training data prompt format and not different from normal fine-tuning. However, I have not done any FIM fine-tuning so I may be wrong. |
I did some quick tests and it looks like there are issues in the mlx implementation. I couldn't figure out exactly where, but when you use the transformers example as shown below, you can see that the model generates correct FIM code. However, this doesn't work in the mlx implementation.
output:
|
You tried it with the correct prompt right? Do we add the FIM prompt by default or is that something you have to do manually at the moment? |
I used |
Based on @Blaizzy's PR (#518), the instruction prompt failed because it was using traditional Rope. Changing to traditional = False should fix it. Edit: |
Ok just to make sure I understand - the only difference between this PR and #518 is that the RoPE traditional is False? |
If that's the case, let's simply update that here? It's a one line change. I think it makes more sense than starting on a separate PR? |
Can confirm, seems to work now! Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the addition and contributions everyone!
@awni, there are some minor model args that need to be cleaned up. For example, the use of |
Didn't notice the last update on this PR early Saturday. I'm happy I could help make it work for everyone 🚀! |
@mzbac I can do that :) |
Could you elaborate on this? |
Yeah, in the current implementation here https://github.com/ml-explore/mlx-examples/blob/main/llms/mlx_lm/models/starcoder2.py#L71-L76, it would be updated to something like : https://github.com/ml-explore/mlx-examples/blob/main/llms/mlx_lm/models/llama.py#L83-L85. |
I see, thanks for clarifying! Will do. I thought so too because I used the llama's repeat. If I may ask, why was it done differently? |
Yeah, based on the previous PR, simply repeating is faster than concatenating. Just FYI: #443 |
This feature came at the perfect time for me! 24 hours ago there was no support and now there is. Love open source! 🩶 |
So I'm trying to fine-tune StarCoder2 using QLoRA. $ pwd
>>> /.../mlx-examples/llms When I issue the fine-tuning command, I get an $ python -m mlx_lm.lora \
--model mlx-community/starcoder2-3b-4bit \
--train \
--data $(realpath ../lora/data) \
--iters 10
>>> None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
Loading pretrained model
Fetching 7 files: 100%|███████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 61422.86it/s]
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/Users/sameenislam/source/sameenislam/mlx-examples/llms/mlx_lm/lora.py", line 246, in <module>
run(args)
File "/Users/sameenislam/source/sameenislam/mlx-examples/llms/mlx_lm/lora.py", line 172, in run
linear_to_lora_layers(model, args.lora_layers)
File "/Users/sameenislam/source/sameenislam/mlx-examples/llms/mlx_lm/tuner/utils.py", line 27, in linear_to_lora_layers
if model.model_type in [
^^^^^^^^^^^^^^^^
File "/Users/sameenislam/anaconda3/lib/python3.11/site-packages/mlx/nn/layers/base.py", line 137, in __getattr__
super(Module, self).__getattr__(key, val)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'super' object has no attribute '__getattr__'. Did you mean: '__setattr__'? Has anyone encountered this? Also including this to show that inference is working: $ python -m mlx_lm.generate --model mlx-community/starcoder2-3b-4bit --prompt "hello"
>>> None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
Fetching 7 files: 100%|██████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 145347.17it/s]
==========
Prompt: hello
_1(void) {
printf("hello, world\n");
}
void helloworld_print_double_1(double a) {
printf("%f\n", a);
}
double helloworld_square_1(double a) {
return a * a;
}
double helloworld_square_2(double a) {
return a * a;
}
double helloworld_square_
==========
Prompt: 16.348 tokens-per-sec
Generation: 40.031 tokens-per-sec |
There is a missing model_type in the starcoder2. You can try adding it to your local code as shown in this PR. |
Awesome, I've made the change locally from PR #522 and it's working like a charm! $ python -m mlx_lm.lora \
--model $(realpath mlx_model) \
--train \
--data $(realpath ../lora/data) \
--iters 10
>>> None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
Loading pretrained model
Total parameters 602.983M
Trainable parameters 1.212M
Loading datasets
Training
Starting training..., iters: 10
Iter 1: Val loss 2.363, Val took 30.839s
Iter 10: Train loss 2.274, Learning Rate 1.000e-05, It/sec 0.490, Tokens/sec 196.036, Trained Tokens 3999
Saved final adapter weights to adapters.npz. P.S. $ python -m mlx_lm.convert \
--hf-path bigcode/starcoder2-3b \
-q |
* Add Starcoder2 model and update utils.py * Refactor model arguments and modules in starcoder2.py * Refactor FeedForward class to MLP in starcoder2.py * Fix typo * pre-commit * Refactor starcoder2.py: Update model arguments and modules * Fix LM head and MLP layers * Rename input layer norm * Update bias in linear layers * Refactor token embeddings in Starcoder2Model * Rename to standard HF attention layer name * Add LayerNorm * Add transposed token embeddings (like in Gemma) * Refactor MLP and TransformerBlock classes * Add tie_word_embeddings option to ModelArgs and update Model implementation * Add conditional check for tying word embeddings in Starcoder2Model * Fix bias in lm_head linear layer * Remove unused LayerNorm in stablelm * Update transformers dependency to use GitHub repository * fix lm head bug, revert transformer req * Update RoPE initialization in Attention class --------- Co-authored-by: Awni Hannun <[email protected]>
Adding new code models that dropped
merged to Transformers