-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG]deepspeed-chat training error on v100 * 8, raise assert param.ds_status == ZeroParamStatus.AVAILABLE, param.ds_summary() after training of step3 #4194
Comments
HI, facing the same error, any solutions? |
Same.. |
Update report. reward score --> step=0, rank=2, tensor([0.2354], device='cuda:2', dtype=torch.bfloat16) Traceback (most recent call last): |
I think this may be related to these two paras:
I set both paras to 1, and another error occurs, see #4226 |
I didn't set this parameter, but it also has this bug. |
But the default value is 8 or 4, I think you need to manually set them to 1. |
I use --inference_tp_size 1 and --tp_gather_partition_size 1, but it doesn't work. it crash same error. |
@iamsile, @GasolSun36, @xwjiang2010 -- thank you for reporting the error. We have two scripts for llama2 models that we have tested quite extensively and they have worked without errors for us. Can you please try either of the scripts in this folder first? Please let me know if these work on your system or not. We will work closely with you to resolve these issues :) |
I have replicated the error on a 8xA100(80G) setup with |
That's correct. ZeRO=2 works, but it is not enough if we want to run 70B models. |
Hi awan-10,
Did this framework not support Baichuan this kind of Chinese LLM? I'm confused. |
Hi, everyone. I tested run_llama2_7b.sh on step1 without any problems, and I need more time to test step2 and step3. |
Same issue here. Trying to replicate https://github.com/ray-project/ray/tree/workspace_templates_2.6.1/doc/source/templates/04_finetuning_llms_with_deepspeed Setup: Cluster of 16 Nvidia L4 running docker Deepspeed config:
Error ST:
Stage 2 works fine, only fails on stage 3. I tried a few things and none worked:
Thanks for the support, we are excited for this technology :) |
I have discovered that the issue started at transformers-4.32.0 and transformers-4.31.0 works fine. |
@awan-10 hi, I had finished this three step that step1 and step2 worked well, but the step3 also had failed. the step3 script: deepspeed --include localhost:7,6,5,4,3,2 --master_port 12346 main.py --data_path Dahoas/rm-static --data_split 2,4,4 --actor_model_name_or_path /xxxxxx/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/model_output --critic_model_name_or_path /xxxxxx/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step2_reward_model_finetuning/model_output --num_padding_at_beginning 1 --per_device_generation_batch_size 1 --per_device_training_batch_size 1 --generation_batches 1 --ppo_epochs 1 --max_answer_seq_len 256 --max_prompt_seq_len 256 --actor_learning_rate 9.65e-6 --critic_learning_rate 5e-6 --actor_weight_decay 0.1 --critic_weight_decay 0.1 --num_train_epochs 1 --lr_scheduler_type cosine --gradient_accumulation_steps 1 --actor_gradient_checkpointing --critic_gradient_checkpointing --offload_reference_model --disable_actor_dropout --num_warmup_steps 100 --deepspeed --seed 1234 --actor_zero_stage 3 --critic_zero_stage 3 --actor_lora_dim 64 --critic_lora_dim 64 --critic_lora_module_name "layers." --actor_lora_module_name "layers." --output_dir /xxxxxx/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/model_output --offload --offload_reference_model this step3 training log like this: [end] Initialized Actor Model [end] (duration: 118.33s)*
File "/xxxxxx/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 492, in main File "/xxxxxx/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 492, in main
File "/xxxxxx/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 135, in generate_experience File "/xxxxxx/DeepSpeedExamples/applications/DeepSpeed-Chat/training/utils/model/reward_model.py", line 135, in forward_value File "/xxxxxx/DeepSpeedExamples/applications/DeepSpeed-Chat/training/utils/model/reward_model.py", line 135, in forward_value File "/xxxxxx/DeepSpeedExamples/applications/DeepSpeed-Chat/training/utils/model/reward_model.py", line 135, in forward_value File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 685, in forward result = forward_call(*input, **kwargs) File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 685, in forward File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 685, in forward File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint
return CheckpointFunction.apply(function, preserve, *args) File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 107, in forward File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 107, in forward File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 681, in custom_forward
outputs = run_function(*args) File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 408, in forward result = forward_call(*input, **kwargs)hidden_states, self_attn_weights, present_key_value = self.self_attn( File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 408, in forward File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 408, in forward File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 305, in forward File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 305, in forward
File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1201, in _call_impl
File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 505, in pre_sub_module_forward_function result = hook(self, input) File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn result = hook(self, input) File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs)ret_val = func(*args, **kwargs) File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 392, in _pre_forward_module_hook ret_val = func(*args, **kwargs) File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 392, in _pre_forward_module_hook File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 392, in _pre_forward_module_hook File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 505, in pre_sub_module_forward_function File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 284, in fetch_sub_module
File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 284, in fetch_sub_module self.__all_gather_params(params_to_fetch, forward)self.__all_gather_params(params_to_fetch, forward)ret_val = func(*args, **kwargs) File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs) File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 428, in __all_gather_params File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
self._all_gather_params(nonquantized_params, forward, quantize=self.zero_quantized_weights) File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 446, in _all_gather_params
File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1132, in all_gather_coalesced File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/utils.py", line 842, in get_only_unique_item File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/utils.py", line 842, in get_only_unique_item RuntimeErrorRuntimeErrorraise RuntimeError(f"expected there to be only one unique element in {items}")RuntimeErrorRuntimeError: : RuntimeError expected there to be only one unique element in <generator object Init._convert_to_deepspeed_param..all_gather_coalesced.. at 0x7fbc541fbd80> : this is my pip list: this is my ds_report: JIT compiled ops requires ninja op name ................ installed .. compatible DeepSpeed general environment info: |
I tried the transformers-4.31.0 and it solved my problem. Great thanks to @denizyuret |
Hi All, thank you for trying out various things. We are aware of this new problem with ZeRO Stage 3, HF transformers > 4.31.0. Closing this issue as the downgrade resolves the problem. Please reopen with a new error. Also, strongly encourage you to open a new issue if someone is still seeing errors despite downgrading to transformers 4.31.0. |
I am seeing this issue even with |
Also seeing this issue when using either deepspeed 0.10.1 or 0.10.3 with
|
Same. I also meet the issue even with |
Also seeing this with |
i resolve a similar issue in changing stage 3 into stage 2 |
If you're using PEFT, try updating to peft==0.9.0—it fixed the error for me. |
Describe the bug
Hi, everybody, I'm traning a llama model in step3 using deepspeed-chat. In version 0.10.1, it raised the following error(see in logs bleow). so I switch branch to HeyangQin/fix_issue_3156(#3156) and copy code into master to fix it. after that I find a new bug when training RL.
The full training command:
deepspeed --include localhost:7,6,5,4 /xxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/main.py --data_output_path /xxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/rlhf_13b_data_output --actor_model_name_or_path /xxxxx/new_model_20230808/pytorch_model.bin --tokenizer_type LLaMATokenizer --llm_pretrained /xxxxx/new_model_20230808/pretrain --tokenizer_name_or_path /xxxxx/new_model_20230808/tokenizer --critic_model_name_or_path /xxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step2_reward_model_finetuning/output_zh_13b_mplug_multi_modal_0815/pytorch_model.bin --data_path coco_zh/coco_zh_rm --actor_zero_stage 3 --critic_zero_stage 3 --num_padding_at_beginning 0 --per_device_train_batch_size 1 --per_device_mini_train_batch_size 1 --ppo_epochs 5 --actor_learning_rate 9.65e-6 --critic_learning_rate 5e-6 --gradient_accumulation_steps 1 --deepspeed --actor_lora_dim 1 --actor_lora_module_name q_proj.lora_A.default,q_proj.lora_B.default,v_proj.lora_A.default,v_proj.lora_B.default,k_proj --critic_lora_dim 1 --critic_lora_module_name q_proj.lora_A.default,q_proj.lora_B.default,v_proj.lora_A.default,v_proj.lora_B.default,k_proj --offload_reference_model --actor_learning_rate 5e-4 --critic_learning_rate 5e-6 --max_answer_seq_len 512 --max_prompt_seq_len 200 --actor_weight_decay 0.1 --critic_weight_decay 0.1 --actor_gradient_checkpointing --only_optimize_lora --output_dir /xxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/rlhf_mplug_13b_model_output_20230815 --offload --print_answers
The training log:
File "/xxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/modeling_mplug_owl.py", line 1677, in forward
text_embeds = self.get_input_embeddings()(text_tokens_) # Temporally Embedding
File "/xxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1201, in _call_impl
result = hook(self, input)
File "/xxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/xxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 383, in _pre_forward_module_hook
self.pre_sub_module_forward_function(module)
File "/xxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/xxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 495, in pre_sub_module_forward_function
param_coordinator.fetch_sub_module(sub_module, forward=True)
File "/xxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/xxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/xxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 328, in fetch_sub_module
assert param.ds_status == ZeroParamStatus.AVAILABLE, param.ds_summary()
AssertionError: {'id': 1489, 'status': 'INFLIGHT', 'numel': 412180480, 'ds_numel': 412180480, 'shape': (80504, 5120), 'ds_shape': (80504, 5120), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': {543}, 'ds_tensor.shape': torch.Size([103045120])}
Expected behavior
A clear and concise description of what you expected to happen.
ds_report output
[2023-08-23 03:17:51,889] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
DeepSpeed C++/CUDA extension op report
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
JIT compiled ops requires ninja
ninja .................. [OKAY]
op name ................ installed .. compatible
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
[WARNING] please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
DeepSpeed general environment info:
torch install path ............... ['/xxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch']
torch version .................... 1.13.1
deepspeed install path ........... ['/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed-0.10.2+8fb111c0-py3.10.egg/deepspeed']
deepspeed info ................... 0.10.2+8fb111c0, 8fb111c, master
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 10.1
deepspeed wheel compiled w. ...... torch 1.13, cuda 11.7
shared memory (/dev/shm) size .... 251.53 GB
Screenshots
If applicable, add screenshots to help explain your problem.
System info (please complete the following information):
OS: ubuntu20.04
8*v100 32G
pytorch:1.7.1
deepspeed: 0.10.1
The text was updated successfully, but these errors were encountered: