[FSDP+QLoRA] ValueError: Expected a cuda device, but got: cpu #1674

iseesaw · 2024-04-24T08:47:41Z

System Info

pip list

accelerate                0.29.3
bitsandbytes              0.43.1
datasets                  2.14.6
huggingface-hub           0.20.3
llama-recipes             0.0.1
peft                      0.10.0
safetensors               0.4.2      
tokenizers                0.19.1
torch                     2.1.2
transformers              4.40.0
cupy-cuda12x              12.1.0
nvidia-cuda-cupti-cu12    12.1.105
nvidia-cuda-nvrtc-cu12    12.1.105
nvidia-cuda-runtime-cu12  12.1.105

8xA6000 48G, CUDA Version: 12.2

Who can help?

No response

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder
My own task or dataset (give details below)

Reproduction

Code from https://github.com/huggingface/alignment-handbook/tree/main/recipes/zephyr-141b-A35b
Set use_dora=True in LoRAConfig
Running with my modified command from the following

ACCELERATE_LOG_LEVEL=info TRANSFORMERS_VERBOSITY=info accelerate launch --config_file recipes/accelerate_configs/fsdp.yaml scripts/run_orpo.py recipes/zephyr-141b-A35b/orpo/config_qlora.yaml

Raise ValueError

Traceback (most recent call last):
  File "/root/kyzhang/llms/UltraMedical/llm_dpo/run_sft.py", line 209, in <module>
    main()
  File "/root/kyzhang/llms/UltraMedical/llm_dpo/run_sft.py", line 141, in main
    trainer = SFTTrainer(
  File "/root/miniconda3/lib/python3.10/site-packages/trl/trainer/sft_trainer.py", line 228, in __init__
    model = get_peft_model(model, peft_config)
  File "/root/miniconda3/lib/python3.10/site-packages/peft/mapping.py", line 136, in get_peft_model
    return MODEL_TYPE_TO_PEFT_MODEL_MAPPING[peft_config.task_type](model, peft_config, adapter_name=adapter_name)
  File "/root/miniconda3/lib/python3.10/site-packages/peft/peft_model.py", line 1094, in __init__
    super().__init__(model, peft_config, adapter_name)
  File "/root/miniconda3/lib/python3.10/site-packages/peft/peft_model.py", line 129, in __init__
    self.base_model = cls(model, {adapter_name: peft_config}, adapter_name)
  File "/root/miniconda3/lib/python3.10/site-packages/peft/tuners/lora/model.py", line 136, in __init__
    super().__init__(model, config, adapter_name)
  File "/root/miniconda3/lib/python3.10/site-packages/peft/tuners/tuners_utils.py", line 148, in __init__
    self.inject_adapter(self.model, adapter_name)
  File "/root/miniconda3/lib/python3.10/site-packages/peft/tuners/tuners_utils.py", line 325, in inject_adapter
    self._create_and_replace(peft_config, adapter_name, target, target_name, parent, current_key=key)
  File "/root/miniconda3/lib/python3.10/site-packages/peft/tuners/lora/model.py", line 220, in _create_and_replace
    new_module = self._create_new_module(lora_config, adapter_name, target, **kwargs)
  File "/root/miniconda3/lib/python3.10/site-packages/peft/tuners/lora/model.py", line 295, in _create_new_module
    new_module = dispatcher(target, adapter_name, lora_config=lora_config, **kwargs)
  File "/root/miniconda3/lib/python3.10/site-packages/peft/tuners/lora/bnb.py", line 506, in dispatch_bnb_4bit
    new_module = Linear4bit(target, adapter_name, **fourbit_kwargs)
  File "/root/miniconda3/lib/python3.10/site-packages/peft/tuners/lora/bnb.py", line 293, in __init__
    self.update_layer(
  File "/root/miniconda3/lib/python3.10/site-packages/peft/tuners/lora/layer.py", line 126, in update_layer
    self.dora_init(adapter_name)
  File "/root/miniconda3/lib/python3.10/site-packages/peft/tuners/lora/layer.py", line 186, in dora_init
    weight = dequantize_bnb_weight(weight, state=quant_state)  # no-op if not bnb
  File "/root/miniconda3/lib/python3.10/site-packages/peft/utils/integrations.py", line 58, in dequantize_bnb_weight
    return bnb.functional.dequantize_4bit(weight.data, weight.quant_state)
  File "/root/miniconda3/lib/python3.10/site-packages/bitsandbytes/functional.py", line 1353, in dequantize_4bit
    device = pre_call(A.device)
  File "/root/miniconda3/lib/python3.10/site-packages/bitsandbytes/functional.py", line 459, in pre_call
    torch.cuda.set_device(device)
  File "/root/miniconda3/lib/python3.10/site-packages/torch/cuda/__init__.py", line 402, in set_device
    device = _get_device_index(device)
  File "/root/miniconda3/lib/python3.10/site-packages/torch/cuda/_utils.py", line 35, in _get_device_index
    raise ValueError(f"Expected a cuda device, but got: {device}")
ValueError: Expected a cuda device, but got: cpu

Expected behavior

The text was updated successfully, but these errors were encountered:

iseesaw · 2024-04-24T11:36:37Z

I successfully trained the LLaMA-3-70B model using the script from the official PEFT example: run_peft_qlora_fsdp.sh.

However, I'm still encountering this problem when I set use_dora=True in the code.

BenjaminBossan · 2024-04-25T08:55:46Z

Thanks for reporting. It looks like at initialization time, the model is still on CPU. As initializing DoRA requires us to dequantize the bnb weights, which is not supported on CPU, we see this error. This should hopefully not be that hard to fix on our side. Meanwhile, perhaps you can adjust your scripts so that the base model is sent to GPU before calling get_peft_model and check if that works.

Edit: Honestly not sure how the weights can be on CPU here, maybe some form of offloading? In that case, the problem probably runs deeper. Are you aware if any offloading goes on here?

mallorbc · 2024-05-09T23:28:18Z

I have this same issue. I can do Lora/Dora, DDP Lora/Dora, QLora/QDora, DDP QLora/QDora, FSDP Lora/Dora, and FSDP QLora but FSDP QDora does not seem to be working.

Resolves huggingface#1674 For some users, it is necessary to initialize the model on CPU, even when using BitsAndBytes, which requires a GPU eventually. Since DoRA requires to dequantize the BNB weights at initialization, we need to temporarily move the model corresponding weights to GPU. After dequantization, the weights are moved back to CPU.

BenjaminBossan · 2024-05-10T11:22:32Z

@iseesaw @mallorbc I created a PR (#1724), if you could give it a try, that would be great.

Resolves #1674 For some users, it is necessary to initialize the model on CPU, even when using BitsAndBytes, which requires a GPU eventually. Since DoRA requires to dequantize the BNB weights at initialization, we need to temporarily move the model corresponding weights to GPU. After dequantization, the weights are moved back to CPU.

mallorbc · 2024-05-16T22:57:38Z

This fixed the issue I was having, but when using DORA/QDora with FSDP it errors outs:

[rank0]: Traceback (most recent call last):
[rank0]: File "trl_finetune.py", line 401, in
[rank0]: trainer.train(resume_from_checkpoint=args.resume_from_checkpoint)
[rank0]: File "/usr/local/lib/python3.8/dist-packages/trl/trainer/sft_trainer.py", line 361, in train
[rank0]: output = super().train(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1859, in train
[rank0]: return inner_training_loop(
[rank0]: File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 2002, in _inner_training_loop
[rank0]: self.model = self.accelerator.prepare(self.model)
[rank0]: File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 1292, in prepare
[rank0]: result = tuple(
[rank0]: File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 1293, in
[rank0]: self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
[rank0]: File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 1169, in _prepare_one
[rank0]: return self.prepare_model(obj, device_placement=device_placement)
[rank0]: File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 1459, in prepare_model
[rank0]: model = FSDP(model, **kwargs)
[rank0]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 485, in init
[rank0]: _auto_wrap(
[rank0]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_wrap_utils.py", line 101, in _auto_wrap
[rank0]: _recursive_wrap(**recursive_wrap_kwargs, **root_kwargs) # type: ignore[arg-type]
[rank0]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap
[rank0]: wrapped_child, num_wrapped_params = _recursive_wrap(
[rank0]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap
[rank0]: wrapped_child, num_wrapped_params = _recursive_wrap(
[rank0]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap
[rank0]: wrapped_child, num_wrapped_params = _recursive_wrap(
[rank0]: [Previous line repeated 2 more times]
[rank0]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/wrap.py", line 561, in _recursive_wrap
[rank0]: return _wrap(module, wrapper_cls, **kwargs), nonwrapped_numel
[rank0]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/wrap.py", line 490, in _wrap
[rank0]: return wrapper_cls(module, **kwargs)
[rank0]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 511, in init
[rank0]: _init_param_handle_from_module(
[rank0]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_init_utils.py", line 598, in _init_param_handle_from_module
[rank0]: _init_param_handle_from_params(state, managed_params, fully_sharded_module)
[rank0]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_init_utils.py", line 610, in _init_param_handle_from_params
[rank0]: handle = FlatParamHandle(
[rank0]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_flat_param.py", line 582, in init
[rank0]: self._init_flat_param_and_metadata(
[rank0]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_flat_param.py", line 632, in _init_flat_param_and_metadata
[rank0]: ) = self._validate_tensors_to_flatten(params)
[rank0]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_flat_param.py", line 770, in _validate_tensors_to_flatten
[rank0]: raise ValueError(
[rank0]: ValueError: Must flatten tensors with uniform dtype but got torch.bfloat16 and torch.float32
Map: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3541/3541 [00:00<00:00, 12989.37 examples/s]
/usr/local/lib/python3.8/dist-packages/trl/trainer/sft_trainer.py:318: UserWarning: You passed a tokenizer with padding_side not equal to right to the SFTTrainer. This might lead to some unexpected behaviour due to overflow issues when training a model in half-precision. You might consider adding tokenizer.padding_side = 'right' to your code.
warnings.warn(
[rank1]: Traceback (most recent call last):
[rank1]: File "trl_finetune.py", line 401, in
[rank1]: trainer.train(resume_from_checkpoint=args.resume_from_checkpoint)
[rank1]: File "/usr/local/lib/python3.8/dist-packages/trl/trainer/sft_trainer.py", line 361, in train
[rank1]: output = super().train(*args, **kwargs)
[rank1]: File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1859, in train
[rank1]: return inner_training_loop(
[rank1]: File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 2002, in _inner_training_loop
[rank1]: self.model = self.accelerator.prepare(self.model)
[rank1]: File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 1292, in prepare
[rank1]: result = tuple(
[rank1]: File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 1293, in
[rank1]: self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
[rank1]: File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 1169, in _prepare_one
[rank1]: return self.prepare_model(obj, device_placement=device_placement)
[rank1]: File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 1459, in prepare_model
[rank1]: model = FSDP(model, **kwargs)
[rank1]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 485, in init
[rank1]: _auto_wrap(
[rank1]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_wrap_utils.py", line 101, in _auto_wrap
[rank1]: _recursive_wrap(**recursive_wrap_kwargs, **root_kwargs) # type: ignore[arg-type]
[rank1]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap
[rank1]: wrapped_child, num_wrapped_params = _recursive_wrap(
[rank1]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap
[rank1]: wrapped_child, num_wrapped_params = _recursive_wrap(
[rank1]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap
[rank1]: wrapped_child, num_wrapped_params = _recursive_wrap(
[rank1]: [Previous line repeated 2 more times]
[rank1]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/wrap.py", line 561, in _recursive_wrap
[rank1]: return _wrap(module, wrapper_cls, **kwargs), nonwrapped_numel
[rank1]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/wrap.py", line 490, in _wrap
[rank1]: return wrapper_cls(module, **kwargs)
[rank1]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 511, in init
[rank1]: _init_param_handle_from_module(
[rank1]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_init_utils.py", line 598, in _init_param_handle_from_module
[rank1]: _init_param_handle_from_params(state, managed_params, fully_sharded_module)
[rank1]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_init_utils.py", line 610, in _init_param_handle_from_params
[rank1]: handle = FlatParamHandle(
[rank1]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_flat_param.py", line 582, in init
[rank1]: self._init_flat_param_and_metadata(
[rank1]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_flat_param.py", line 632, in _init_flat_param_and_metadata
[rank1]: ) = self._validate_tensors_to_flatten(params)
[rank1]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_flat_param.py", line 770, in _validate_tensors_to_flatten
[rank1]: raise ValueError(
[rank1]: ValueError: Must flatten tensors with uniform dtype but got torch.bfloat16 and torch.float32

BenjaminBossan · 2024-05-22T13:27:01Z

I just want to let you know that I'm still investigating, this issue is not forgotten :) It's just not that easy understanding what goes on under the hood with FSDP.

BenjaminBossan · 2024-05-31T14:58:50Z

Update: DoRA and QDoRA training with FSDP should be fixed in #1806. If you install from the latest PEFT main, it should thus work. Please also check the PR description on how this was tested. If you give it a try, LMK if it works or not.

iseesaw mentioned this issue Apr 24, 2024

FSDP + QDoRA Support huggingface/alignment-handbook#159

Open

iseesaw changed the title ~~ValueError: Expected a cuda device, but got: cpu (FSDP + QDoRA in AlignmentBook)~~ ValueError: Expected a cuda device, but got: cpu (FSDP + QDoRA) Apr 24, 2024

iseesaw changed the title ~~ValueError: Expected a cuda device, but got: cpu (FSDP + QDoRA)~~ [FSDP+QLoRA] ValueError: Expected a cuda device, but got: cpu Apr 24, 2024

BenjaminBossan mentioned this issue May 10, 2024

FIX Allow DoRA init on CPU when using BNB #1724

Merged

BenjaminBossan closed this as completed in #1724 May 14, 2024

mallorbc mentioned this issue May 17, 2024

FSDP Dora/QDora Broken #1737

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FSDP+QLoRA] ValueError: Expected a cuda device, but got: cpu #1674

[FSDP+QLoRA] ValueError: Expected a cuda device, but got: cpu #1674

iseesaw commented Apr 24, 2024 •

edited

Loading

iseesaw commented Apr 24, 2024

BenjaminBossan commented Apr 25, 2024 •

edited

Loading

mallorbc commented May 9, 2024 •

edited

Loading

BenjaminBossan commented May 10, 2024

mallorbc commented May 16, 2024

BenjaminBossan commented May 22, 2024

BenjaminBossan commented May 31, 2024

[FSDP+QLoRA] ValueError: Expected a cuda device, but got: cpu #1674

[FSDP+QLoRA] ValueError: Expected a cuda device, but got: cpu #1674

Comments

iseesaw commented Apr 24, 2024 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

iseesaw commented Apr 24, 2024

BenjaminBossan commented Apr 25, 2024 • edited Loading

mallorbc commented May 9, 2024 • edited Loading

BenjaminBossan commented May 10, 2024

mallorbc commented May 16, 2024

BenjaminBossan commented May 22, 2024

BenjaminBossan commented May 31, 2024

iseesaw commented Apr 24, 2024 •

edited

Loading

BenjaminBossan commented Apr 25, 2024 •

edited

Loading

mallorbc commented May 9, 2024 •

edited

Loading