-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FSDP+QLoRA] ValueError: Expected a cuda device, but got: cpu #1674
Comments
I successfully trained the LLaMA-3-70B model using the script from the official PEFT example: run_peft_qlora_fsdp.sh. However, I'm still encountering this problem when I set |
Thanks for reporting. It looks like at initialization time, the model is still on CPU. As initializing DoRA requires us to dequantize the bnb weights, which is not supported on CPU, we see this error. This should hopefully not be that hard to fix on our side. Meanwhile, perhaps you can adjust your scripts so that the base model is sent to GPU before calling Edit: Honestly not sure how the weights can be on CPU here, maybe some form of offloading? In that case, the problem probably runs deeper. Are you aware if any offloading goes on here? |
I have this same issue. I can do Lora/Dora, DDP Lora/Dora, QLora/QDora, DDP QLora/QDora, FSDP Lora/Dora, and FSDP QLora but FSDP QDora does not seem to be working. |
Resolves huggingface#1674 For some users, it is necessary to initialize the model on CPU, even when using BitsAndBytes, which requires a GPU eventually. Since DoRA requires to dequantize the BNB weights at initialization, we need to temporarily move the model corresponding weights to GPU. After dequantization, the weights are moved back to CPU.
Resolves #1674 For some users, it is necessary to initialize the model on CPU, even when using BitsAndBytes, which requires a GPU eventually. Since DoRA requires to dequantize the BNB weights at initialization, we need to temporarily move the model corresponding weights to GPU. After dequantization, the weights are moved back to CPU.
This fixed the issue I was having, but when using DORA/QDora with FSDP it errors outs: [rank0]: Traceback (most recent call last): |
I just want to let you know that I'm still investigating, this issue is not forgotten :) It's just not that easy understanding what goes on under the hood with FSDP. |
Update: DoRA and QDoRA training with FSDP should be fixed in #1806. If you install from the latest PEFT main, it should thus work. Please also check the PR description on how this was tested. If you give it a try, LMK if it works or not. |
System Info
pip list
8xA6000 48G, CUDA Version: 12.2
Who can help?
No response
Information
Tasks
examples
folderReproduction
Code from https://github.com/huggingface/alignment-handbook/tree/main/recipes/zephyr-141b-A35b
Set
use_dora=True
in LoRAConfigRunning with my modified command from the following
Raise ValueError
Expected behavior
The text was updated successfully, but these errors were encountered: