-
Notifications
You must be signed in to change notification settings - Fork 27.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Qlora on open llama 13b fails #24245
Comments
Hi @nivibilla, Please make sure to search the issues first, as it's possible they have previously been reported and resolved e.g.: Could you try installing accelerate, peft and transformers from source, and rerunning your script
|
sorry mb, I am already installing from source so Im not sure what went wrong. In any case, will test again and let you know |
I did as you asked @amyeroberts , installed from source. But I still get the same error. |
|
Was fixed when I used this particular branch
Will this branch be merged? |
Note I am using 4bit quantisation in training, which may be the cause of the issue as mentioned in #23935 |
Another issue I have encountered with the branch I tested is that it doesn't save a adapter_config.json for the checkpoints. |
Update: fixed the adapter_config saving issue by
However the issue still remains when using the normal installation instead of the particular commit mentioned |
That's great to hear! Peculiar that it didn't work from source though 🤔
This commit has already been merged, I believe, and is part of the latest release. Could you confirm the version of transformers that was installed when the problem was happening initially?
Hmmmm.... I have no idea about this cc @pacman100 who knows a lot more about Peft and Trainer :) |
I did transformers.version and got |
I had this same issue today, always stopped around 1 epoch with the same error. I was trying to fine-tune llama-13b as well, on my own dataset, which I know is correctly formatted. |
Using git source pip install too. Trying |
cc @younesbelkada As you've been working on the related issue |
@richardr1126 are your checkpoints saving properly? I had to write a custom call back as the adapter_config wasn't being written |
Yeah, I used your PeftSavingCallback below and added it to the callbacks param in the Trainer. It created the adapter_config and adapter_model and saved them into the
|
Hello @nivibilla, PR #24415 should fix this. Can you confirm the same? |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
I think this works. Haven't tested though. Will close for now |
System Info
Installed by
!pip install -q -U git+https://github.com/huggingface/transformers.git
On databricks
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Interestingly failed at exactly 1 Epoch
Expected behavior
Run normally?
The text was updated successfully, but these errors were encountered: