Training crashes with Accelerate 0.33.0 #22

rationalism · 2024-08-24T16:49:05Z

If you upgrade to the new accelerate, 0.33.0, BNB QLoRA training crashes with this stack trace:

loading checkpoint file model-00001-of-00030.safetensors
load params into module <class 'llama_pipe.LlamaDecoderLayerPipe'>
Traceback (most recent call last):
  File "/home/alyssa/lm_fun/qlora-pipe/train.py", line 418, in <module>
    pipeline_model, lora_model, lora_config = load_pipeline_model_with_lora(config, model_type)
  File "/home/alyssa/lm_fun/qlora-pipe/train.py", line 279, in load_pipeline_model_with_lora
    pipeline_model = engine.CustomPipelineModule(
  File "/home/alyssa/lm_fun/qlora-pipe/engine.py", line 274, in __init__
    super().__init__(layers, **kwargs)
  File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/deepspeed/runtime/pipe/module.py", line 212, in __init__
    self._build()
  File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/deepspeed/runtime/pipe/module.py", line 268, in _build
    module = layer.build()
  File "/home/alyssa/lm_fun/qlora-pipe/pipeline_model.py", line 75, in build
    return self.typename(*self.module_args, **self.module_kwargs)
  File "/home/alyssa/lm_fun/qlora-pipe/llama_pipe.py", line 113, in __init__
    loader_util.load_state_dict_into_module(self)
  File "/home/alyssa/lm_fun/qlora-pipe/pipeline_model.py", line 316, in load_state_dict_into_module
    transformers.modeling_utils._load_state_dict_into_meta_model(
  File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/transformers/modeling_utils.py", line 961, in _load_state_dict_into_meta_model
    set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
  File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 436, in set_module_tensor_to_device
    new_value = param_cls(new_value, requires_grad=old_value.requires_grad, **kwargs).to(device)
TypeError: Params4bit.__new__() got an unexpected keyword argument 'original_name'
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/alyssa/lm_fun/qlora-pipe/train.py", line 418, in <module>
[rank0]:     pipeline_model, lora_model, lora_config = load_pipeline_model_with_lora(config, model_type)
[rank0]:   File "/home/alyssa/lm_fun/qlora-pipe/train.py", line 279, in load_pipeline_model_with_lora
[rank0]:     pipeline_model = engine.CustomPipelineModule(
[rank0]:   File "/home/alyssa/lm_fun/qlora-pipe/engine.py", line 274, in __init__
[rank0]:     super().__init__(layers, **kwargs)
[rank0]:   File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/deepspeed/runtime/pipe/module.py", line 212, in __init__
[rank0]:     self._build()
[rank0]:   File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/deepspeed/runtime/pipe/module.py", line 268, in _build
[rank0]:     module = layer.build()
[rank0]:   File "/home/alyssa/lm_fun/qlora-pipe/pipeline_model.py", line 75, in build
[rank0]:     return self.typename(*self.module_args, **self.module_kwargs)
[rank0]:   File "/home/alyssa/lm_fun/qlora-pipe/llama_pipe.py", line 113, in __init__
[rank0]:     loader_util.load_state_dict_into_module(self)
[rank0]:   File "/home/alyssa/lm_fun/qlora-pipe/pipeline_model.py", line 316, in load_state_dict_into_module
[rank0]:     transformers.modeling_utils._load_state_dict_into_meta_model(
[rank0]:   File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/transformers/modeling_utils.py", line 961, in _load_state_dict_into_meta_model
[rank0]:     set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
[rank0]:   File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 436, in set_module_tensor_to_device
[rank0]:     new_value = param_cls(new_value, requires_grad=old_value.requires_grad, **kwargs).to(device)
[rank0]: TypeError: Params4bit.__new__() got an unexpected keyword argument 'original_name'

Suspect it's because of this PR:

huggingface/accelerate#2934

This PR might also be relevant:

huggingface/accelerate#2986

Reverting to Accelerate 0.32.0 resolves the crash. Thank you!

The text was updated successfully, but these errors were encountered:

tdrussell · 2024-10-13T19:23:40Z

This should be fixed as of the latest commits now.

tdrussell closed this as completed Oct 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training crashes with Accelerate 0.33.0 #22

Training crashes with Accelerate 0.33.0 #22

rationalism commented Aug 24, 2024

tdrussell commented Oct 13, 2024

Training crashes with Accelerate 0.33.0 #22

Training crashes with Accelerate 0.33.0 #22

Comments

rationalism commented Aug 24, 2024

tdrussell commented Oct 13, 2024