Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]deepspeed-chat training error on v100 * 8, raise assert param.ds_status == ZeroParamStatus.AVAILABLE, param.ds_summary() after training of step3 #4194

Closed
iamsile opened this issue Aug 23, 2023 · 23 comments
Assignees
Labels
bug Something isn't working deepspeed-chat Related to DeepSpeed-Chat

Comments

@iamsile
Copy link

iamsile commented Aug 23, 2023

Describe the bug

Hi, everybody, I'm traning a llama model in step3 using deepspeed-chat. In version 0.10.1, it raised the following error(see in logs bleow). so I switch branch to HeyangQin/fix_issue_3156(#3156) and copy code into master to fix it. after that I find a new bug when training RL.

The full training command:

deepspeed --include localhost:7,6,5,4 /xxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/main.py --data_output_path /xxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/rlhf_13b_data_output --actor_model_name_or_path /xxxxx/new_model_20230808/pytorch_model.bin --tokenizer_type LLaMATokenizer --llm_pretrained /xxxxx/new_model_20230808/pretrain --tokenizer_name_or_path /xxxxx/new_model_20230808/tokenizer --critic_model_name_or_path /xxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step2_reward_model_finetuning/output_zh_13b_mplug_multi_modal_0815/pytorch_model.bin --data_path coco_zh/coco_zh_rm --actor_zero_stage 3 --critic_zero_stage 3 --num_padding_at_beginning 0 --per_device_train_batch_size 1 --per_device_mini_train_batch_size 1 --ppo_epochs 5 --actor_learning_rate 9.65e-6 --critic_learning_rate 5e-6 --gradient_accumulation_steps 1 --deepspeed --actor_lora_dim 1 --actor_lora_module_name q_proj.lora_A.default,q_proj.lora_B.default,v_proj.lora_A.default,v_proj.lora_B.default,k_proj --critic_lora_dim 1 --critic_lora_module_name q_proj.lora_A.default,q_proj.lora_B.default,v_proj.lora_A.default,v_proj.lora_B.default,k_proj --offload_reference_model --actor_learning_rate 5e-4 --critic_learning_rate 5e-6 --max_answer_seq_len 512 --max_prompt_seq_len 200 --actor_weight_decay 0.1 --critic_weight_decay 0.1 --actor_gradient_checkpointing --only_optimize_lora --output_dir /xxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/rlhf_mplug_13b_model_output_20230815 --offload --print_answers

The training log:

File "/xxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/modeling_mplug_owl.py", line 1677, in forward
text_embeds = self.get_input_embeddings()(text_tokens_) # Temporally Embedding
File "/xxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1201, in _call_impl
result = hook(self, input)
File "/xxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/xxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 383, in _pre_forward_module_hook
self.pre_sub_module_forward_function(module)
File "/xxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/xxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 495, in pre_sub_module_forward_function
param_coordinator.fetch_sub_module(sub_module, forward=True)
File "/xxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/xxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/xxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 328, in fetch_sub_module
assert param.ds_status == ZeroParamStatus.AVAILABLE, param.ds_summary()
AssertionError: {'id': 1489, 'status': 'INFLIGHT', 'numel': 412180480, 'ds_numel': 412180480, 'shape': (80504, 5120), 'ds_shape': (80504, 5120), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': {543}, 'ds_tensor.shape': torch.Size([103045120])}

Expected behavior
A clear and concise description of what you expected to happen.

ds_report output
[2023-08-23 03:17:51,889] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
DeepSpeed C++/CUDA extension op report
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
JIT compiled ops requires ninja
ninja .................. [OKAY]
op name ................ installed .. compatible
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
[WARNING] please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
DeepSpeed general environment info:
torch install path ............... ['/xxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch']
torch version .................... 1.13.1
deepspeed install path ........... ['/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed-0.10.2+8fb111c0-py3.10.egg/deepspeed']
deepspeed info ................... 0.10.2+8fb111c0, 8fb111c, master
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 10.1
deepspeed wheel compiled w. ...... torch 1.13, cuda 11.7
shared memory (/dev/shm) size .... 251.53 GB

Screenshots
If applicable, add screenshots to help explain your problem.

System info (please complete the following information):
OS: ubuntu20.04
8*v100 32G
pytorch:1.7.1
deepspeed: 0.10.1

@iamsile iamsile added bug Something isn't working training labels Aug 23, 2023
@mrwyattii mrwyattii added deepspeed-chat Related to DeepSpeed-Chat and removed training labels Aug 23, 2023
@GasolSun36
Copy link

HI, facing the same error, any solutions?

@xwjiang2010
Copy link

Same..

@iamsile
Copy link
Author

iamsile commented Aug 29, 2023

Update report.
In my lastest test. I found copy HeyangQin/fix_issue_3156 into master hasn't work. In RL training, it only works at step0, after that it must be crash. this is a full report:

reward score --> step=0, rank=2, tensor([0.2354], device='cuda:2', dtype=torch.bfloat16)
reward score --> step=0, rank=0, tensor([0.4492], device='cuda:0', dtype=torch.bfloat16)
reward score --> step=0, rank=1, tensor([0.0601], device='cuda:1', dtype=torch.bfloat16)
use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False...
use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False...
use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False...
/xxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")
/xxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")
/xxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")
Epoch: 0 | Step: 0 | PPO Epoch: 1 | Actor Loss: -0.55859375 | Critic Loss: 0.216796875 | Unsupervised Loss: 0.0
Average reward score: 0.2470703125

Traceback (most recent call last):
File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/main.py", line 672, in
main()
File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/main.py", line 541, in main
out = trainer.generate_experience(batch_prompt['images'],
File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 153, in generate_experience
seq = self._generate_sequence(images, prompts, mask, step)
File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 95, in _generate_sequence
seq = self.actor_model.module.forward(images,
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/peft/peft_model.py", line 296, in forward
return self.get_base_model()(*args, **kwargs)
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
result = forward_call(*input, **kwargs)
File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/modeling_mplug_owl.py", line 1672, in forward
outputs = self.language_model.generate(
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/transformers/generation/utils.py", line 1538, in generate
return self.greedy_search(
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/transformers/generation/utils.py", line 2362, in greedy_search
outputs = self(
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
result = forward_call(*input, **kwargs)
File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 794, in forward
outputs = self.model(
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
result = forward_call(*input, **kwargs)
File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 677, in forward
layer_outputs = decoder_layer(
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
result = forward_call(*input, **kwargs)
File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 372, in forward
hidden_states = self.input_layernorm(hidden_states)
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1201, in _call_impl
result = hook(self, input)
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 383, in _pre_forward_module_hook
self.pre_sub_module_forward_function(module)
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 495, in pre_sub_module_forward_function
param_coordinator.fetch_sub_module(sub_module, forward=True)
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 328, in fetch_sub_module
assert param.ds_status == ZeroParamStatus.AVAILABLE, param.ds_summary()
AssertionError: {'id': 918, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 5120, 'shape': (0,), 'ds_shape': (5120,), 'requires_grad': False, 'grad_shape': None, 'persist': True, 'active_sub_modules': {578}, 'ds_tensor.shape': torch.Size([1707])}
Traceback (most recent call last):
File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/main.py", line 672, in
main()
File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/main.py", line 541, in main
out = trainer.generate_experience(batch_prompt['images'],
File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 153, in generate_experience
seq = self._generate_sequence(images, prompts, mask, step)
File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 95, in _generate_sequence
seq = self.actor_model.module.forward(images,
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/peft/peft_model.py", line 296, in forward
return self.get_base_model()(*args, **kwargs)
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
result = forward_call(*input, **kwargs)
File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/modeling_mplug_owl.py", line 1672, in forward
outputs = self.language_model.generate(
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/transformers/generation/utils.py", line 1538, in generate
return self.greedy_search(
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/transformers/generation/utils.py", line 2362, in greedy_search
outputs = self(
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
result = forward_call(*input, **kwargs)
File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 794, in forward
outputs = self.model(
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
result = forward_call(*input, **kwargs)
File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 677, in forward
layer_outputs = decoder_layer(
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
result = forward_call(*input, **kwargs)
File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 372, in forward
hidden_states = self.input_layernorm(hidden_states)
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1201, in _call_impl
result = hook(self, input)
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 383, in _pre_forward_module_hook
self.pre_sub_module_forward_function(module)
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 495, in pre_sub_module_forward_function
param_coordinator.fetch_sub_module(sub_module, forward=True)
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 328, in fetch_sub_module
assert param.ds_status == ZeroParamStatus.AVAILABLE, param.ds_summary()
AssertionError: {'id': 918, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 5120, 'shape': (0,), 'ds_shape': (5120,), 'requires_grad': False, 'grad_shape': None, 'persist': True, 'active_sub_modules': {578}, 'ds_tensor.shape': torch.Size([1707])}
Traceback (most recent call last):
File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/main.py", line 672, in
main()
File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/main.py", line 541, in main
out = trainer.generate_experience(batch_prompt['images'],
File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 153, in generate_experience
seq = self._generate_sequence(images, prompts, mask, step)
File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 95, in _generate_sequence
seq = self.actor_model.module.forward(images,
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/peft/peft_model.py", line 296, in forward
return self.get_base_model()(*args, **kwargs)
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
result = forward_call(*input, **kwargs)
File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/modeling_mplug_owl.py", line 1672, in forward
outputs = self.language_model.generate(
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/transformers/generation/utils.py", line 1538, in generate
return self.greedy_search(
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/transformers/generation/utils.py", line 2362, in greedy_search
outputs = self(
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
result = forward_call(*input, **kwargs)
File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 794, in forward
outputs = self.model(
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
result = forward_call(*input, **kwargs)
File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 677, in forward
layer_outputs = decoder_layer(
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
result = forward_call(*input, **kwargs)
File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 372, in forward
hidden_states = self.input_layernorm(hidden_states)
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1201, in _call_impl
result = hook(self, input)
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 383, in _pre_forward_module_hook
self.pre_sub_module_forward_function(module)
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 495, in pre_sub_module_forward_function
param_coordinator.fetch_sub_module(sub_module, forward=True)
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 328, in fetch_sub_module
assert param.ds_status == ZeroParamStatus.AVAILABLE, param.ds_summary()
AssertionError: {'id': 918, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 5120, 'shape': (0,), 'ds_shape': (5120,), 'requires_grad': False, 'grad_shape': None, 'persist': True, 'active_sub_modules': {578}, 'ds_tensor.shape': torch.Size([1707])}
[2023-08-28 07:34:58,470] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 46301
[2023-08-28 07:34:58,470] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 46302
[2023-08-28 07:34:58,476] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 46303

@GasolSun36
Copy link

I think this may be related to these two paras:

--inference_tp_size 1
--tp_gather_partition_size 1 \

I set both paras to 1, and another error occurs, see #4226

@iamsile
Copy link
Author

iamsile commented Aug 29, 2023

I think this may be related to these two paras:

--inference_tp_size 1
--tp_gather_partition_size 1 \

I set both paras to 1, and another error occurs, see #4226

I didn't set this parameter, but it also has this bug.

@GasolSun36
Copy link

But the default value is 8 or 4, I think you need to manually set them to 1.

@iamsile
Copy link
Author

iamsile commented Aug 29, 2023

But the default value is 8 or 4, I think you need to manually set them to 1.

I use --inference_tp_size 1 and --tp_gather_partition_size 1, but it doesn't work. it crash same error.

@awan-10 awan-10 self-assigned this Aug 29, 2023
@awan-10
Copy link
Contributor

awan-10 commented Aug 29, 2023

@iamsile, @GasolSun36, @xwjiang2010 -- thank you for reporting the error.

We have two scripts for llama2 models that we have tested quite extensively and they have worked without errors for us. Can you please try either of the scripts in this folder first?

https://github.com/microsoft/DeepSpeedExamples/tree/master/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/training_scripts/llama2

Please let me know if these work on your system or not. We will work closely with you to resolve these issues :)

@denizyuret
Copy link

I have replicated the error on a 8xA100(80G) setup with DeepSpeed-Chat/training/step1_supervised_finetuning/training_scripts/llama2/run_llama2_7b.sh. The resulting training.log is here. Setting ZERO_STAGE=2 works without problems.

@vinod-sarvam
Copy link

That's correct. ZeRO=2 works, but it is not enough if we want to run 70B models.

@GasolSun36
Copy link

@iamsile, @GasolSun36, @xwjiang2010 -- thank you for reporting the error.

We have two scripts for llama2 models that we have tested quite extensively and they have worked without errors for us. Can you please try either of the scripts in this folder first?

https://github.com/microsoft/DeepSpeedExamples/tree/master/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/training_scripts/llama2

Please let me know if these work on your system or not. We will work closely with you to resolve these issues :)

Hi awan-10,
I tested the new sh file, which works for LLAMA2-7B (actor model) and OPT-350M (critic model) without any problems.
However, when I test the Baichuan-7B (actor model) and OPT-350M, it casues:

AssertionErrorAssertionError assert param.ds_status == ZeroParamStatus.AVAILABLE, param.ds_summary()
{'id': 227, 'status': 'NOT_AVAILABLE', 'numel': 0, 'ds_numel': 0, 'shape': (0,), 'ds_shape': (0, 4096), 'requires_grad': True, 'grad_shape': None, 'persist': True, 'active_sub_modules': {389}, 'ds_tensor.shape': torch.Size([0])}{'id': 227, 'status': 'NOT_AVAILABLE', 'numel': 0, 'ds_numel': 0, 'shape': (0,), 'ds_shape': (0, 4096), 'requires_grad': True, 'grad_shape': None, 'persist': True, 'active_sub_modules': {389}, 'ds_tensor.shape': torch.Size([0])}

Did this framework not support Baichuan this kind of Chinese LLM? I'm confused.

@iamsile
Copy link
Author

iamsile commented Aug 30, 2023

I have replicated the error on a 8xA100(80G) setup with DeepSpeed-Chat/training/step1_supervised_finetuning/training_scripts/llama2/run_llama2_7b.sh. The resulting training.log is here. Setting ZERO_STAGE=2 works without problems.

Hi, everyone. I tested run_llama2_7b.sh on step1 without any problems, and I need more time to test step2 and step3.

@aksbaih
Copy link

aksbaih commented Sep 1, 2023

Same issue here. Trying to replicate https://github.com/ray-project/ray/tree/workspace_templates_2.6.1/doc/source/templates/04_finetuning_llms_with_deepspeed
for llama 7B.

Setup: Cluster of 16 Nvidia L4 running docker anyscale/ray:2.6.1-py39-cu118

Deepspeed config:

{
    "fp16": {
        "enabled": "auto"
    },
    "bf16": {
        "enabled": "auto"
    },
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": 5e8,
        "stage3_prefetch_bucket_size": 5e8,
        "stage3_param_persistence_threshold": 1e6,
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": false,
        "round_robin_gradients": true
    },
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 10,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

Error ST:

  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/tune/trainable/trainable.py", line 375, in train
    raise skipped from exception_cause(skipped)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/train/_internal/utils.py", line 54, in check_for_failure
    ray.get(object_ref)
ray.exceptions.RayTaskError(AssertionError): ray::_RayTrainWorker__execute.get_next() (pid=1771, ip=10.128.0.28, actor_id=728100beae767cf33e536eb805000000, repr=<ray.train._internal.worker_group.RayTrainWorker object at 0x7fbb764c3c70>)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/train/_internal/worker_group.py", line 32, in __execute
    raise skipped from exception_cause(skipped)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/train/_internal/utils.py", line 129, in discard_return_wrapper
    train_func(*args, **kwargs)
  File "/home/ray/ray/doc/source/templates/04_finetuning_llms_with_deepspeed/finetune_hf_llm.py", line 320, in training_function
    outputs = model(**batch)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1801, in forward
    loss = self.module(*inputs, **kwargs)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 827, in forward
    logits = self.lm_head(hidden_states)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    result = hook(self, args)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 392, in _pre_forward_module_hook
    self.pre_sub_module_forward_function(module)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 505, in pre_sub_module_forward_function
    param_coordinator.fetch_sub_module(sub_module, forward=prev_grad_state)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 310, in fetch_sub_module
    assert param.ds_status == ZeroParamStatus.AVAILABLE, param.ds_summary()
AssertionError: {'id': 292, 'status': 'NOT_AVAILABLE', 'numel': 0, 'ds_numel': 0, 'shape': (0,), 'ds_shape': (0, 4096), 'requires_grad': True, 'grad_shape': None, 'persist': True, 'active_sub_modules': {453}, 'ds_tensor.shape': torch.Size([0])}

Stage 2 works fine, only fails on stage 3.

I tried a few things and none worked:

  • Minimized memory usage by using a batch size of 1.
  • Unpinned CPU memory.
  • Logged CPU memory just before the error (was only 20%) a few layers in.
  • Used AWS 16xNvidia A10's.
  • Reduced the size of all buffers in the deepspeed zero config by multiple zeros.
    None of these worked. Same error.

Thanks for the support, we are excited for this technology :)

@denizyuret
Copy link

I have discovered that the issue started at transformers-4.32.0 and transformers-4.31.0 works fine.

@iamsile
Copy link
Author

iamsile commented Sep 4, 2023

@iamsile, @GasolSun36, @xwjiang2010 -- thank you for reporting the error.

We have two scripts for llama2 models that we have tested quite extensively and they have worked without errors for us. Can you please try either of the scripts in this folder first?

https://github.com/microsoft/DeepSpeedExamples/tree/master/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/training_scripts/llama2

Please let me know if these work on your system or not. We will work closely with you to resolve these issues :)

@awan-10 hi, I had finished this three step that step1 and step2 worked well, but the step3 also had failed.

the step3 script:

deepspeed --include localhost:7,6,5,4,3,2 --master_port 12346 main.py --data_path Dahoas/rm-static --data_split 2,4,4 --actor_model_name_or_path /xxxxxx/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/model_output --critic_model_name_or_path /xxxxxx/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step2_reward_model_finetuning/model_output --num_padding_at_beginning 1 --per_device_generation_batch_size 1 --per_device_training_batch_size 1 --generation_batches 1 --ppo_epochs 1 --max_answer_seq_len 256 --max_prompt_seq_len 256 --actor_learning_rate 9.65e-6 --critic_learning_rate 5e-6 --actor_weight_decay 0.1 --critic_weight_decay 0.1 --num_train_epochs 1 --lr_scheduler_type cosine --gradient_accumulation_steps 1 --actor_gradient_checkpointing --critic_gradient_checkpointing --offload_reference_model --disable_actor_dropout --num_warmup_steps 100 --deepspeed --seed 1234 --actor_zero_stage 3 --critic_zero_stage 3 --actor_lora_dim 64 --critic_lora_dim 64 --critic_lora_module_name "layers." --actor_lora_module_name "layers." --output_dir /xxxxxx/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/model_output --offload --offload_reference_model

this step3 training log like this:

[end] Initialized Actor Model [end] (duration: 118.33s)*
*************************[start] Initializing Ref Model [start] **************************
[2023-09-04 03:51:48,694] [INFO] [partition_parameters.py:342:exit] finished initializing model - num_params = 582, num_elems = 13.48B
Some weights of LlamaForCausalLM were not initialized from the model checkpoint at /xxxxxx/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/model_output and are newly initialized: ['model.layers.18.self_attn.rotary_emb.inv_freq', 'model.layers.27.self_attn.rotary_emb.inv_freq', 'model.layers.8.self_attn.rotary_emb.inv_freq', 'model.layers.7.self_attn.rotary_emb.inv_freq', 'model.layers.6.self_attn.rotary_emb.inv_freq', 'model.layers.24.self_attn.rotary_emb.inv_freq', 'model.layers.25.self_attn.rotary_emb.inv_freq', 'model.layers.11.self_attn.rotary_emb.inv_freq', 'model.layers.2.self_attn.rotary_emb.inv_freq', 'model.layers.16.self_attn.rotary_emb.inv_freq', 'model.layers.31.self_attn.rotary_emb.inv_freq', 'model.layers.10.self_attn.rotary_emb.inv_freq', 'model.layers.5.self_attn.rotary_emb.inv_freq', 'model.layers.1.self_attn.rotary_emb.inv_freq', 'model.layers.20.self_attn.rotary_emb.inv_freq', 'model.layers.26.self_attn.rotary_emb.inv_freq', 'model.layers.28.self_attn.rotary_emb.inv_freq', 'model.layers.21.self_attn.rotary_emb.inv_freq', 'model.layers.19.self_attn.rotary_emb.inv_freq', 'model.layers.23.self_attn.rotary_emb.inv_freq', 'model.layers.15.self_attn.rotary_emb.inv_freq', 'model.layers.14.self_attn.rotary_emb.inv_freq', 'model.layers.13.self_attn.rotary_emb.inv_freq', 'model.layers.9.self_attn.rotary_emb.inv_freq', 'model.layers.17.self_attn.rotary_emb.inv_freq', 'model.layers.30.self_attn.rotary_emb.inv_freq', 'model.layers.3.self_attn.rotary_emb.inv_freq', 'model.layers.22.self_attn.rotary_emb.inv_freq', 'model.layers.29.self_attn.rotary_emb.inv_freq', 'model.layers.4.self_attn.rotary_emb.inv_freq', 'model.layers.0.self_attn.rotary_emb.inv_freq', 'model.layers.12.self_attn.rotary_emb.inv_freq']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of LlamaForCausalLM were not initialized from the model checkpoint at /xxxxxx/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/model_output and are newly initialized: ['model.layers.23.self_attn.rotary_emb.inv_freq', 'model.layers.3.self_attn.rotary_emb.inv_freq', 'model.layers.27.self_attn.rotary_emb.inv_freq', 'model.layers.6.self_attn.rotary_emb.inv_freq', 'model.layers.24.self_attn.rotary_emb.inv_freq', 'model.layers.31.self_attn.rotary_emb.inv_freq', 'model.layers.21.self_attn.rotary_emb.inv_freq', 'model.layers.29.self_attn.rotary_emb.inv_freq', 'model.layers.17.self_attn.rotary_emb.inv_freq', 'model.layers.9.self_attn.rotary_emb.inv_freq', 'model.layers.1.self_attn.rotary_emb.inv_freq', 'model.layers.11.self_attn.rotary_emb.inv_freq', 'model.layers.16.self_attn.rotary_emb.inv_freq', 'model.layers.26.self_attn.rotary_emb.inv_freq', 'model.layers.20.self_attn.rotary_emb.inv_freq', 'model.layers.0.self_attn.rotary_emb.inv_freq', 'model.layers.28.self_attn.rotary_emb.inv_freq', 'model.layers.15.self_attn.rotary_emb.inv_freq', 'model.layers.10.self_attn.rotary_emb.inv_freq', 'model.layers.22.self_attn.rotary_emb.inv_freq', 'model.layers.7.self_attn.rotary_emb.inv_freq', 'model.layers.2.self_attn.rotary_emb.inv_freq', 'model.layers.25.self_attn.rotary_emb.inv_freq', 'model.layers.4.self_attn.rotary_emb.inv_freq', 'model.layers.12.self_attn.rotary_emb.inv_freq', 'model.layers.19.self_attn.rotary_emb.inv_freq', 'model.layers.5.self_attn.rotary_emb.inv_freq', 'model.layers.8.self_attn.rotary_emb.inv_freq', 'model.layers.13.self_attn.rotary_emb.inv_freq', 'model.layers.30.self_attn.rotary_emb.inv_freq', 'model.layers.14.self_attn.rotary_emb.inv_freq', 'model.layers.18.self_attn.rotary_emb.inv_freq']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of LlamaForCausalLM were not initialized from the model checkpoint at /xxxxxx/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/model_output and are newly initialized: ['model.layers.23.self_attn.rotary_emb.inv_freq', 'model.layers.6.self_attn.rotary_emb.inv_freq', 'model.layers.28.self_attn.rotary_emb.inv_freq', 'model.layers.22.self_attn.rotary_emb.inv_freq', 'model.layers.24.self_attn.rotary_emb.inv_freq', 'model.layers.5.self_attn.rotary_emb.inv_freq', 'model.layers.3.self_attn.rotary_emb.inv_freq', 'model.layers.30.self_attn.rotary_emb.inv_freq', 'model.layers.12.self_attn.rotary_emb.inv_freq', 'model.layers.1.self_attn.rotary_emb.inv_freq', 'model.layers.13.self_attn.rotary_emb.inv_freq', 'model.layers.25.self_attn.rotary_emb.inv_freq', 'model.layers.31.self_attn.rotary_emb.inv_freq', 'model.layers.16.self_attn.rotary_emb.inv_freq', 'model.layers.10.self_attn.rotary_emb.inv_freq', 'model.layers.26.self_attn.rotary_emb.inv_freq', 'model.layers.4.self_attn.rotary_emb.inv_freq', 'model.layers.17.self_attn.rotary_emb.inv_freq', 'model.layers.7.self_attn.rotary_emb.inv_freq', 'model.layers.21.self_attn.rotary_emb.inv_freq', 'model.layers.18.self_attn.rotary_emb.inv_freq', 'model.layers.20.self_attn.rotary_emb.inv_freq', 'model.layers.27.self_attn.rotary_emb.inv_freq', 'model.layers.2.self_attn.rotary_emb.inv_freq', 'model.layers.0.self_attn.rotary_emb.inv_freq', 'model.layers.11.self_attn.rotary_emb.inv_freq', 'model.layers.9.self_attn.rotary_emb.inv_freq', 'model.layers.19.self_attn.rotary_emb.inv_freq', 'model.layers.29.self_attn.rotary_emb.inv_freq', 'model.layers.15.self_attn.rotary_emb.inv_freq', 'model.layers.14.self_attn.rotary_emb.inv_freq', 'model.layers.8.self_attn.rotary_emb.inv_freq']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of LlamaForCausalLM were not initialized from the model checkpoint at /xxxxxx/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/model_output and are newly initialized: ['model.layers.14.self_attn.rotary_emb.inv_freq', 'model.layers.9.self_attn.rotary_emb.inv_freq', 'model.layers.28.self_attn.rotary_emb.inv_freq', 'model.layers.5.self_attn.rotary_emb.inv_freq', 'model.layers.19.self_attn.rotary_emb.inv_freq', 'model.layers.25.self_attn.rotary_emb.inv_freq', 'model.layers.2.self_attn.rotary_emb.inv_freq', 'model.layers.20.self_attn.rotary_emb.inv_freq', 'model.layers.21.self_attn.rotary_emb.inv_freq', 'model.layers.0.self_attn.rotary_emb.inv_freq', 'model.layers.11.self_attn.rotary_emb.inv_freq', 'model.layers.8.self_attn.rotary_emb.inv_freq', 'model.layers.10.self_attn.rotary_emb.inv_freq', 'model.layers.22.self_attn.rotary_emb.inv_freq', 'model.layers.12.self_attn.rotary_emb.inv_freq', 'model.layers.1.self_attn.rotary_emb.inv_freq', 'model.layers.4.self_attn.rotary_emb.inv_freq', 'model.layers.17.self_attn.rotary_emb.inv_freq', 'model.layers.6.self_attn.rotary_emb.inv_freq', 'model.layers.31.self_attn.rotary_emb.inv_freq', 'model.layers.24.self_attn.rotary_emb.inv_freq', 'model.layers.3.self_attn.rotary_emb.inv_freq', 'model.layers.16.self_attn.rotary_emb.inv_freq', 'model.layers.26.self_attn.rotary_emb.inv_freq', 'model.layers.13.self_attn.rotary_emb.inv_freq', 'model.layers.30.self_attn.rotary_emb.inv_freq', 'model.layers.18.self_attn.rotary_emb.inv_freq', 'model.layers.27.self_attn.rotary_emb.inv_freq', 'model.layers.15.self_attn.rotary_emb.inv_freq', 'model.layers.23.self_attn.rotary_emb.inv_freq', 'model.layers.7.self_attn.rotary_emb.inv_freq', 'model.layers.29.self_attn.rotary_emb.inv_freq']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of LlamaForCausalLM were not initialized from the model checkpoint at /xxxxxx/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/model_output and are newly initialized: ['model.layers.15.self_attn.rotary_emb.inv_freq', 'model.layers.25.self_attn.rotary_emb.inv_freq', 'model.layers.19.self_attn.rotary_emb.inv_freq', 'model.layers.4.self_attn.rotary_emb.inv_freq', 'model.layers.29.self_attn.rotary_emb.inv_freq', 'model.layers.1.self_attn.rotary_emb.inv_freq', 'model.layers.27.self_attn.rotary_emb.inv_freq', 'model.layers.0.self_attn.rotary_emb.inv_freq', 'model.layers.9.self_attn.rotary_emb.inv_freq', 'model.layers.16.self_attn.rotary_emb.inv_freq', 'model.layers.13.self_attn.rotary_emb.inv_freq', 'model.layers.11.self_attn.rotary_emb.inv_freq', 'model.layers.23.self_attn.rotary_emb.inv_freq', 'model.layers.7.self_attn.rotary_emb.inv_freq', 'model.layers.17.self_attn.rotary_emb.inv_freq', 'model.layers.8.self_attn.rotary_emb.inv_freq', 'model.layers.6.self_attn.rotary_emb.inv_freq', 'model.layers.24.self_attn.rotary_emb.inv_freq', 'model.layers.2.self_attn.rotary_emb.inv_freq', 'model.layers.22.self_attn.rotary_emb.inv_freq', 'model.layers.3.self_attn.rotary_emb.inv_freq', 'model.layers.31.self_attn.rotary_emb.inv_freq', 'model.layers.10.self_attn.rotary_emb.inv_freq', 'model.layers.20.self_attn.rotary_emb.inv_freq', 'model.layers.26.self_attn.rotary_emb.inv_freq', 'model.layers.21.self_attn.rotary_emb.inv_freq', 'model.layers.14.self_attn.rotary_emb.inv_freq', 'model.layers.18.self_attn.rotary_emb.inv_freq', 'model.layers.12.self_attn.rotary_emb.inv_freq', 'model.layers.30.self_attn.rotary_emb.inv_freq', 'model.layers.28.self_attn.rotary_emb.inv_freq', 'model.layers.5.self_attn.rotary_emb.inv_freq']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of LlamaForCausalLM were not initialized from the model checkpoint at /xxxxxx/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/model_output and are newly initialized: ['model.layers.0.self_attn.rotary_emb.inv_freq', 'model.layers.20.self_attn.rotary_emb.inv_freq', 'model.layers.11.self_attn.rotary_emb.inv_freq', 'model.layers.1.self_attn.rotary_emb.inv_freq', 'model.layers.27.self_attn.rotary_emb.inv_freq', 'model.layers.2.self_attn.rotary_emb.inv_freq', 'model.layers.22.self_attn.rotary_emb.inv_freq', 'model.layers.31.self_attn.rotary_emb.inv_freq', 'model.layers.4.self_attn.rotary_emb.inv_freq', 'model.layers.28.self_attn.rotary_emb.inv_freq', 'model.layers.17.self_attn.rotary_emb.inv_freq', 'model.layers.3.self_attn.rotary_emb.inv_freq', 'model.layers.29.self_attn.rotary_emb.inv_freq', 'model.layers.6.self_attn.rotary_emb.inv_freq', 'model.layers.18.self_attn.rotary_emb.inv_freq', 'model.layers.19.self_attn.rotary_emb.inv_freq', 'model.layers.9.self_attn.rotary_emb.inv_freq', 'model.layers.10.self_attn.rotary_emb.inv_freq', 'model.layers.30.self_attn.rotary_emb.inv_freq', 'model.layers.8.self_attn.rotary_emb.inv_freq', 'model.layers.15.self_attn.rotary_emb.inv_freq', 'model.layers.25.self_attn.rotary_emb.inv_freq', 'model.layers.14.self_attn.rotary_emb.inv_freq', 'model.layers.16.self_attn.rotary_emb.inv_freq', 'model.layers.21.self_attn.rotary_emb.inv_freq', 'model.layers.12.self_attn.rotary_emb.inv_freq', 'model.layers.5.self_attn.rotary_emb.inv_freq', 'model.layers.24.self_attn.rotary_emb.inv_freq', 'model.layers.26.self_attn.rotary_emb.inv_freq', 'model.layers.7.self_attn.rotary_emb.inv_freq', 'model.layers.13.self_attn.rotary_emb.inv_freq', 'model.layers.23.self_attn.rotary_emb.inv_freq']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[2023-09-04 03:51:57,942] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.10.2, git-hash=unknown, git-branch=unknown
[2023-09-04 03:51:57,953] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
[2023-09-04 03:51:57,954] [INFO] [logging.py:96:log_dist] [Rank 0] Creating ZeRO Offload
[2023-09-04 03:51:58,181] [INFO] [utils.py:803:see_memory_usage] DeepSpeedZeRoOffload initialize [begin]
[2023-09-04 03:51:58,182] [INFO] [utils.py:804:see_memory_usage] MA 1.06 GB Max_MA 1.85 GB CA 2.64 GB Max_CA 3 GB
[2023-09-04 03:51:58,182] [INFO] [utils.py:811:see_memory_usage] CPU Virtual Memory: used = 312.37 GB, percent = 62.1%
Parameter Offload: Total persistent parameters: 266240 in 65 params
[2023-09-04 03:51:58,414] [INFO] [utils.py:803:see_memory_usage] DeepSpeedZeRoOffload initialize [end]
[2023-09-04 03:51:58,414] [INFO] [utils.py:804:see_memory_usage] MA 1.06 GB Max_MA 1.06 GB CA 2.64 GB Max_CA 3 GB
[2023-09-04 03:51:58,414] [INFO] [utils.py:811:see_memory_usage] CPU Virtual Memory: used = 312.37 GB, percent = 62.1%
[2023-09-04 03:51:58,415] [INFO] [config.py:963:print] DeepSpeedEngine configuration:
[2023-09-04 03:51:58,416] [INFO] [config.py:967:print] activation_checkpointing_config {
"partition_activations": false,
"contiguous_memory_optimization": false,
"cpu_checkpointing": false,
"number_checkpoints": null,
"synchronize_checkpoint_boundary": false,
"profile": false
}
[2023-09-04 03:51:58,416] [INFO] [config.py:967:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2023-09-04 03:51:58,416] [INFO] [config.py:967:print] amp_enabled .................. False
[2023-09-04 03:51:58,416] [INFO] [config.py:967:print] amp_params ................... False
[2023-09-04 03:51:58,416] [INFO] [config.py:967:print] autotuning_config ............ {
"enabled": false,
"start_step": null,
"end_step": null,
"metric_path": null,
"arg_mappings": null,
"metric": "throughput",
"model_info": null,
"results_dir": "autotuning_results",
"exps_dir": "autotuning_exps",
"overwrite": true,
"fast": true,
"start_profile_step": 3,
"end_profile_step": 5,
"tuner_type": "gridsearch",
"tuner_early_stopping": 5,
"tuner_num_trials": 50,
"model_info_path": null,
"mp_size": 1,
"max_train_batch_size": null,
"min_train_batch_size": 1,
"max_train_micro_batch_size_per_gpu": 1.024000e+03,
"min_train_micro_batch_size_per_gpu": 1,
"num_tuning_micro_batch_sizes": 3
}
[2023-09-04 03:51:58,416] [INFO] [config.py:967:print] bfloat16_enabled ............. False
[2023-09-04 03:51:58,416] [INFO] [config.py:967:print] checkpoint_parallel_write_pipeline False
[2023-09-04 03:51:58,416] [INFO] [config.py:967:print] checkpoint_tag_validation_enabled True
[2023-09-04 03:51:58,416] [INFO] [config.py:967:print] checkpoint_tag_validation_fail False
[2023-09-04 03:51:58,416] [INFO] [config.py:967:print] comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7f677c3cc700>
[2023-09-04 03:51:58,416] [INFO] [config.py:967:print] communication_data_type ...... None
[2023-09-04 03:51:58,416] [INFO] [config.py:967:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2023-09-04 03:51:58,416] [INFO] [config.py:967:print] curriculum_enabled_legacy .... False
[2023-09-04 03:51:58,416] [INFO] [config.py:967:print] curriculum_params_legacy ..... False
[2023-09-04 03:51:58,416] [INFO] [config.py:967:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2023-09-04 03:51:58,416] [INFO] [config.py:967:print] data_efficiency_enabled ...... False
[2023-09-04 03:51:58,416] [INFO] [config.py:967:print] dataloader_drop_last ......... False
[2023-09-04 03:51:58,416] [INFO] [config.py:967:print] disable_allgather ............ False
[2023-09-04 03:51:58,416] [INFO] [config.py:967:print] dump_state ................... False
[2023-09-04 03:51:58,416] [INFO] [config.py:967:print] dynamic_loss_scale_args ...... None
[2023-09-04 03:51:58,416] [INFO] [config.py:967:print] eigenvalue_enabled ........... False
[2023-09-04 03:51:58,416] [INFO] [config.py:967:print] eigenvalue_gas_boundary_resolution 1
[2023-09-04 03:51:58,416] [INFO] [config.py:967:print] eigenvalue_layer_name ........ bert.encoder.layer
[2023-09-04 03:51:58,416] [INFO] [config.py:967:print] eigenvalue_layer_num ......... 0
[2023-09-04 03:51:58,416] [INFO] [config.py:967:print] eigenvalue_max_iter .......... 100
[2023-09-04 03:51:58,416] [INFO] [config.py:967:print] eigenvalue_stability ......... 1e-06
[2023-09-04 03:51:58,416] [INFO] [config.py:967:print] eigenvalue_tol ............... 0.01
[2023-09-04 03:51:58,416] [INFO] [config.py:967:print] eigenvalue_verbose ........... False
[2023-09-04 03:51:58,416] [INFO] [config.py:967:print] elasticity_enabled ........... False
[2023-09-04 03:51:58,416] [INFO] [config.py:967:print] flops_profiler_config ........ {
"enabled": false,
"recompute_fwd_factor": 0.0,
"profile_step": 1,
"module_depth": -1,
"top_modules": 1,
"detailed": true,
"output_file": null
}
[2023-09-04 03:51:58,416] [INFO] [config.py:967:print] fp16_auto_cast ............... False
[2023-09-04 03:51:58,417] [INFO] [config.py:967:print] fp16_enabled ................. True
[2023-09-04 03:51:58,417] [INFO] [config.py:967:print] fp16_master_weights_and_gradients False
[2023-09-04 03:51:58,417] [INFO] [config.py:967:print] global_rank .................. 0
[2023-09-04 03:51:58,417] [INFO] [config.py:967:print] grad_accum_dtype ............. None
[2023-09-04 03:51:58,417] [INFO] [config.py:967:print] gradient_accumulation_steps .. 1
[2023-09-04 03:51:58,417] [INFO] [config.py:967:print] gradient_clipping ............ 1.0
[2023-09-04 03:51:58,417] [INFO] [config.py:967:print] gradient_predivide_factor .... 1.0
[2023-09-04 03:51:58,417] [INFO] [config.py:967:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
[2023-09-04 03:51:58,417] [INFO] [config.py:967:print] initial_dynamic_scale ........ 65536
[2023-09-04 03:51:58,417] [INFO] [config.py:967:print] load_universal_checkpoint .... False
[2023-09-04 03:51:58,417] [INFO] [config.py:967:print] loss_scale ................... 0
[2023-09-04 03:51:58,417] [INFO] [config.py:967:print] memory_breakdown ............. False
[2023-09-04 03:51:58,417] [INFO] [config.py:967:print] mics_hierarchial_params_gather False
[2023-09-04 03:51:58,417] [INFO] [config.py:967:print] mics_shard_size .............. -1
[2023-09-04 03:51:58,417] [INFO] [config.py:967:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
[2023-09-04 03:51:58,417] [INFO] [config.py:967:print] nebula_config ................ {
"enabled": false,
"persistent_storage_path": null,
"persistent_time_interval": 100,
"num_of_version_in_retention": 2,
"enable_nebula_load": true,
"load_path": null
}
[2023-09-04 03:51:58,417] [INFO] [config.py:967:print] optimizer_legacy_fusion ...... False
[2023-09-04 03:51:58,417] [INFO] [config.py:967:print] optimizer_name ............... None
[2023-09-04 03:51:58,417] [INFO] [config.py:967:print] optimizer_params ............. None
[2023-09-04 03:51:58,417] [INFO] [config.py:967:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2023-09-04 03:51:58,417] [INFO] [config.py:967:print] pld_enabled .................. False
[2023-09-04 03:51:58,417] [INFO] [config.py:967:print] pld_params ................... False
[2023-09-04 03:51:58,417] [INFO] [config.py:967:print] prescale_gradients ........... False
[2023-09-04 03:51:58,417] [INFO] [config.py:967:print] scheduler_name ............... None
[2023-09-04 03:51:58,417] [INFO] [config.py:967:print] scheduler_params ............. None
[2023-09-04 03:51:58,417] [INFO] [config.py:967:print] sparse_attention ............. None
[2023-09-04 03:51:58,417] [INFO] [config.py:967:print] sparse_gradients_enabled ..... False
[2023-09-04 03:51:58,417] [INFO] [config.py:967:print] steps_per_print .............. 10
[2023-09-04 03:51:58,417] [INFO] [config.py:967:print] train_batch_size ............. 6
[2023-09-04 03:51:58,417] [INFO] [config.py:967:print] train_micro_batch_size_per_gpu 1
[2023-09-04 03:51:58,417] [INFO] [config.py:967:print] use_node_local_storage ....... False
[2023-09-04 03:51:58,417] [INFO] [config.py:967:print] wall_clock_breakdown ......... False
[2023-09-04 03:51:58,417] [INFO] [config.py:967:print] world_size ................... 6
[2023-09-04 03:51:58,417] [INFO] [config.py:967:print] zero_allow_untested_optimizer False
[2023-09-04 03:51:58,417] [INFO] [config.py:967:print] zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500,000,000 allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='cpu', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=False) offload_optimizer=None sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50,000,000 param_persistence_threshold=10000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=False pipeline_loading_checkpoint=False override_module_apply=True
[2023-09-04 03:51:58,417] [INFO] [config.py:967:print] zero_enabled ................. True
[2023-09-04 03:51:58,417] [INFO] [config.py:967:print] zero_force_ds_cpu_optimizer .. True
[2023-09-04 03:51:58,417] [INFO] [config.py:967:print] zero_optimization_stage ...... 3
[2023-09-04 03:51:58,417] [INFO] [config.py:953:print_user_config] json = {
"train_batch_size": 6,
"train_micro_batch_size_per_gpu": 1,
"steps_per_print": 10,
"zero_optimization": {
"stage": 3,
"stage3_param_persistence_threshold": 1.000000e+04,
"offload_param": {
"device": "cpu"
},
"memory_efficient_linear": false
},
"fp16": {
"enabled": true
},
"gradient_clipping": 1.0,
"prescale_gradients": false,
"wall_clock_breakdown": false
}
[end] Initialized Ref Model [end] (duration: 23.50s)
************************[start] Initializing Critic Model [start] ************************
[2023-09-04 03:51:59,800] [INFO] [partition_parameters.py:342:exit] finished initializing model - num_params = 872, num_elems = 20.08B

Creating model from_config took 1.394550085067749 seconds
torch.load took 14.609045505523682 seconds
Loading model state dict took 5.597141265869141 seconds
�[93m [WARNING] �[0m cpu_adam cuda is missing or is incompatible with installed torch, only cpu ops can be compiled!
Using /root/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
No modifications detected for re-loaded extension module cpu_adam, skipping build step...
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.8365261554718018 seconds
Adam Optimizer #1 is created with AVX512 arithmetic capability.
Config: alpha=0.000005, betas=(0.900000, 0.950000), weight_decay=0.000000, adam_w=1
�[93m [WARNING] �[0m cpu_adam cuda is missing or is incompatible with installed torch, only cpu ops can be compiled!
Using /root/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
No modifications detected for re-loaded extension module cpu_adam, skipping build step...
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.3674826622009277 seconds
Adam Optimizer #1 is created with AVX512 arithmetic capability.
Config: alpha=0.000005, betas=(0.900000, 0.950000), weight_decay=0.000000, adam_w=1
�[93m [WARNING] �[0m cpu_adam cuda is missing or is incompatible with installed torch, only cpu ops can be compiled!
Using /root/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
No modifications detected for re-loaded extension module cpu_adam, skipping build step...
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.3644497394561768 seconds
Adam Optimizer #1 is created with AVX512 arithmetic capability.
Config: alpha=0.000005, betas=(0.900000, 0.950000), weight_decay=0.000000, adam_w=1
�[93m [WARNING] �[0m cpu_adam cuda is missing or is incompatible with installed torch, only cpu ops can be compiled!
Using /root/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
No modifications detected for re-loaded extension module cpu_adam, skipping build step...
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.3637731075286865 seconds
Adam Optimizer #1 is created with AVX512 arithmetic capability.
Config: alpha=0.000005, betas=(0.900000, 0.950000), weight_decay=0.000000, adam_w=1
[2023-09-04 03:53:17,190] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.10.2, git-hash=unknown, git-branch=unknown
�[93m [WARNING] �[0m cpu_adam cuda is missing or is incompatible with installed torch, only cpu ops can be compiled!
Using /root/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
No modifications detected for re-loaded extension module cpu_adam, skipping build step...
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.3590688705444336 seconds
Adam Optimizer #1 is created with AVX512 arithmetic capability.
Config: alpha=0.000005, betas=(0.900000, 0.950000), weight_decay=0.000000, adam_w=1
[2023-09-04 03:53:17,239] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
[2023-09-04 03:53:17,241] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the client Optimizer
[2023-09-04 03:53:17,241] [INFO] [logging.py:96:log_dist] [Rank 0] Using client Optimizer as basic optimizer
�[93m [WARNING] �[0m cpu_adam cuda is missing or is incompatible with installed torch, only cpu ops can be compiled!
Using /root/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
No modifications detected for re-loaded extension module cpu_adam, skipping build step...
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.372196912765503 seconds
Adam Optimizer #1 is created with AVX512 arithmetic capability.
Config: alpha=0.000005, betas=(0.900000, 0.950000), weight_decay=0.000000, adam_w=1
[2023-09-04 03:53:17,281] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam
[2023-09-04 03:53:17,281] [INFO] [utils.py:54:is_zero_supported_optimizer] Checking ZeRO support for optimizer=DeepSpeedCPUAdam type=<class 'deepspeed.ops.adam.cpu_adam.DeepSpeedCPUAdam'>
[2023-09-04 03:53:17,282] [INFO] [logging.py:96:log_dist] [Rank 0] Creating fp16 ZeRO stage 3 optimizer, MiCS is enabled False, Hierarchical params gather False
[2023-09-04 03:53:17,282] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 3 optimizer
[2023-09-04 03:53:17,584] [INFO] [utils.py:803:see_memory_usage] Stage 3 initialize beginning
[2023-09-04 03:53:17,585] [INFO] [utils.py:804:see_memory_usage] MA 3.48 GB Max_MA 3.73 GB CA 14.28 GB Max_CA 14 GB
[2023-09-04 03:53:17,585] [INFO] [utils.py:811:see_memory_usage] CPU Virtual Memory: used = 312.73 GB, percent = 62.2%
[2023-09-04 03:53:17,588] [INFO] [stage3.py:126:init] Reduce bucket size 500,000,000
[2023-09-04 03:53:17,588] [INFO] [stage3.py:127:init] Prefetch bucket size 30000000
[2023-09-04 03:53:17,826] [INFO] [utils.py:803:see_memory_usage] DeepSpeedZeRoOffload initialize [begin]
[2023-09-04 03:53:17,826] [INFO] [utils.py:804:see_memory_usage] MA 3.48 GB Max_MA 3.48 GB CA 14.28 GB Max_CA 14 GB
[2023-09-04 03:53:17,826] [INFO] [utils.py:811:see_memory_usage] CPU Virtual Memory: used = 312.73 GB, percent = 62.2%
Parameter Offload: Total persistent parameters: 270336 in 66 params
[2023-09-04 03:53:18,200] [INFO] [utils.py:803:see_memory_usage] DeepSpeedZeRoOffload initialize [end]
[2023-09-04 03:53:18,201] [INFO] [utils.py:804:see_memory_usage] MA 3.22 GB Max_MA 3.48 GB CA 14.28 GB Max_CA 14 GB
[2023-09-04 03:53:18,201] [INFO] [utils.py:811:see_memory_usage] CPU Virtual Memory: used = 312.73 GB, percent = 62.2%
[2023-09-04 03:53:18,440] [INFO] [utils.py:803:see_memory_usage] Before creating fp16 partitions
[2023-09-04 03:53:18,441] [INFO] [utils.py:804:see_memory_usage] MA 3.22 GB Max_MA 3.22 GB CA 14.28 GB Max_CA 14 GB
[2023-09-04 03:53:18,441] [INFO] [utils.py:811:see_memory_usage] CPU Virtual Memory: used = 312.74 GB, percent = 62.2%
[2023-09-04 03:53:19,357] [INFO] [utils.py:803:see_memory_usage] After creating fp16 partitions: 2
[2023-09-04 03:53:19,358] [INFO] [utils.py:804:see_memory_usage] MA 3.13 GB Max_MA 3.22 GB CA 14.28 GB Max_CA 14 GB
[2023-09-04 03:53:19,358] [INFO] [utils.py:811:see_memory_usage] CPU Virtual Memory: used = 314.41 GB, percent = 62.5%
[2023-09-04 03:53:19,585] [INFO] [utils.py:803:see_memory_usage] Before creating fp32 partitions
[2023-09-04 03:53:19,586] [INFO] [utils.py:804:see_memory_usage] MA 3.13 GB Max_MA 3.13 GB CA 14.28 GB Max_CA 14 GB
[2023-09-04 03:53:19,586] [INFO] [utils.py:811:see_memory_usage] CPU Virtual Memory: used = 314.41 GB, percent = 62.5%
[2023-09-04 03:53:19,839] [INFO] [utils.py:803:see_memory_usage] After creating fp32 partitions
[2023-09-04 03:53:19,839] [INFO] [utils.py:804:see_memory_usage] MA 3.13 GB Max_MA 3.13 GB CA 14.28 GB Max_CA 14 GB
[2023-09-04 03:53:19,839] [INFO] [utils.py:811:see_memory_usage] CPU Virtual Memory: used = 314.59 GB, percent = 62.5%
[2023-09-04 03:53:20,334] [INFO] [utils.py:803:see_memory_usage] Before initializing optimizer states
[2023-09-04 03:53:20,335] [INFO] [utils.py:804:see_memory_usage] MA 3.13 GB Max_MA 3.13 GB CA 14.28 GB Max_CA 14 GB
[2023-09-04 03:53:20,335] [INFO] [utils.py:811:see_memory_usage] CPU Virtual Memory: used = 317.23 GB, percent = 63.1%
[2023-09-04 03:53:21,077] [INFO] [utils.py:803:see_memory_usage] After initializing optimizer states
[2023-09-04 03:53:21,078] [INFO] [utils.py:804:see_memory_usage] MA 3.13 GB Max_MA 3.13 GB CA 14.28 GB Max_CA 14 GB
[2023-09-04 03:53:21,078] [INFO] [utils.py:811:see_memory_usage] CPU Virtual Memory: used = 317.91 GB, percent = 63.2%
[2023-09-04 03:53:21,078] [INFO] [stage3.py:445:_setup_for_real_optimizer] optimizer state initialized
[2023-09-04 03:53:21,687] [INFO] [utils.py:803:see_memory_usage] After initializing ZeRO optimizer
[2023-09-04 03:53:21,687] [INFO] [utils.py:804:see_memory_usage] MA 4.07 GB Max_MA 4.55 GB CA 15.21 GB Max_CA 15 GB
[2023-09-04 03:53:21,687] [INFO] [utils.py:811:see_memory_usage] CPU Virtual Memory: used = 318.47 GB, percent = 63.3%
[2023-09-04 03:53:21,687] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = DeepSpeedCPUAdam
[2023-09-04 03:53:21,688] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client LR scheduler
[2023-09-04 03:53:21,688] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = <torch.optim.lr_scheduler.LambdaLR object at 0x7f6771170c10>
[2023-09-04 03:53:21,688] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0, 0.0], mom=[(0.9, 0.95), (0.9, 0.95)]
[2023-09-04 03:53:21,689] [INFO] [config.py:963:print] DeepSpeedEngine configuration:
[2023-09-04 03:53:21,689] [INFO] [config.py:967:print] activation_checkpointing_config {
"partition_activations": false,
"contiguous_memory_optimization": false,
"cpu_checkpointing": false,
"number_checkpoints": null,
"synchronize_checkpoint_boundary": false,
"profile": false
}
[2023-09-04 03:53:21,689] [INFO] [config.py:967:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2023-09-04 03:53:21,689] [INFO] [config.py:967:print] amp_enabled .................. False
[2023-09-04 03:53:21,689] [INFO] [config.py:967:print] amp_params ................... False
[2023-09-04 03:53:21,689] [INFO] [config.py:967:print] autotuning_config ............ {
"enabled": false,
"start_step": null,
"end_step": null,
"metric_path": null,
"arg_mappings": null,
"metric": "throughput",
"model_info": null,
"results_dir": "autotuning_results",
"exps_dir": "autotuning_exps",
"overwrite": true,
"fast": true,
"start_profile_step": 3,
"end_profile_step": 5,
"tuner_type": "gridsearch",
"tuner_early_stopping": 5,
"tuner_num_trials": 50,
"model_info_path": null,
"mp_size": 1,
"max_train_batch_size": null,
"min_train_batch_size": 1,
"max_train_micro_batch_size_per_gpu": 1.024000e+03,
"min_train_micro_batch_size_per_gpu": 1,
"num_tuning_micro_batch_sizes": 3
}
[2023-09-04 03:53:21,689] [INFO] [config.py:967:print] bfloat16_enabled ............. True
[2023-09-04 03:53:21,689] [INFO] [config.py:967:print] checkpoint_parallel_write_pipeline False
[2023-09-04 03:53:21,689] [INFO] [config.py:967:print] checkpoint_tag_validation_enabled True
[2023-09-04 03:53:21,689] [INFO] [config.py:967:print] checkpoint_tag_validation_fail False
[2023-09-04 03:53:21,689] [INFO] [config.py:967:print] comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7f677129e290>
[2023-09-04 03:53:21,690] [INFO] [config.py:967:print] communication_data_type ...... None
[2023-09-04 03:53:21,690] [INFO] [config.py:967:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2023-09-04 03:53:21,690] [INFO] [config.py:967:print] curriculum_enabled_legacy .... False
[2023-09-04 03:53:21,690] [INFO] [config.py:967:print] curriculum_params_legacy ..... False
[2023-09-04 03:53:21,690] [INFO] [config.py:967:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2023-09-04 03:53:21,690] [INFO] [config.py:967:print] data_efficiency_enabled ...... False
[2023-09-04 03:53:21,690] [INFO] [config.py:967:print] dataloader_drop_last ......... False
[2023-09-04 03:53:21,690] [INFO] [config.py:967:print] disable_allgather ............ False
[2023-09-04 03:53:21,690] [INFO] [config.py:967:print] dump_state ................... False
[2023-09-04 03:53:21,690] [INFO] [config.py:967:print] dynamic_loss_scale_args ...... None
[2023-09-04 03:53:21,690] [INFO] [config.py:967:print] eigenvalue_enabled ........... False
[2023-09-04 03:53:21,690] [INFO] [config.py:967:print] eigenvalue_gas_boundary_resolution 1
[2023-09-04 03:53:21,690] [INFO] [config.py:967:print] eigenvalue_layer_name ........ bert.encoder.layer
[2023-09-04 03:53:21,690] [INFO] [config.py:967:print] eigenvalue_layer_num ......... 0
[2023-09-04 03:53:21,690] [INFO] [config.py:967:print] eigenvalue_max_iter .......... 100
[2023-09-04 03:53:21,690] [INFO] [config.py:967:print] eigenvalue_stability ......... 1e-06
[2023-09-04 03:53:21,690] [INFO] [config.py:967:print] eigenvalue_tol ............... 0.01
[2023-09-04 03:53:21,690] [INFO] [config.py:967:print] eigenvalue_verbose ........... False
[2023-09-04 03:53:21,690] [INFO] [config.py:967:print] elasticity_enabled ........... False
[2023-09-04 03:53:21,690] [INFO] [config.py:967:print] flops_profiler_config ........ {
"enabled": false,
"recompute_fwd_factor": 0.0,
"profile_step": 1,
"module_depth": -1,
"top_modules": 1,
"detailed": true,
"output_file": null
}
[2023-09-04 03:53:21,690] [INFO] [config.py:967:print] fp16_auto_cast ............... None
[2023-09-04 03:53:21,690] [INFO] [config.py:967:print] fp16_enabled ................. False
[2023-09-04 03:53:21,690] [INFO] [config.py:967:print] fp16_master_weights_and_gradients False
[2023-09-04 03:53:21,690] [INFO] [config.py:967:print] global_rank .................. 0
[2023-09-04 03:53:21,690] [INFO] [config.py:967:print] grad_accum_dtype ............. None
[2023-09-04 03:53:21,690] [INFO] [config.py:967:print] gradient_accumulation_steps .. 1
[2023-09-04 03:53:21,690] [INFO] [config.py:967:print] gradient_clipping ............ 1.0
[2023-09-04 03:53:21,690] [INFO] [config.py:967:print] gradient_predivide_factor .... 1.0
[2023-09-04 03:53:21,690] [INFO] [config.py:967:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
[2023-09-04 03:53:21,690] [INFO] [config.py:967:print] initial_dynamic_scale ........ 1
[2023-09-04 03:53:21,690] [INFO] [config.py:967:print] load_universal_checkpoint .... False
[2023-09-04 03:53:21,690] [INFO] [config.py:967:print] loss_scale ................... 1.0
[2023-09-04 03:53:21,690] [INFO] [config.py:967:print] memory_breakdown ............. False
[2023-09-04 03:53:21,690] [INFO] [config.py:967:print] mics_hierarchial_params_gather False
[2023-09-04 03:53:21,690] [INFO] [config.py:967:print] mics_shard_size .............. -1
[2023-09-04 03:53:21,690] [INFO] [config.py:967:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='step3_tensorboard/ds_tensorboard_logs/', job_name='step3_critic_tensorboard') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
[2023-09-04 03:53:21,690] [INFO] [config.py:967:print] nebula_config ................ {
"enabled": false,
"persistent_storage_path": null,
"persistent_time_interval": 100,
"num_of_version_in_retention": 2,
"enable_nebula_load": true,
"load_path": null
}
[2023-09-04 03:53:21,690] [INFO] [config.py:967:print] optimizer_legacy_fusion ...... False
[2023-09-04 03:53:21,691] [INFO] [config.py:967:print] optimizer_name ............... None
[2023-09-04 03:53:21,691] [INFO] [config.py:967:print] optimizer_params ............. None
[2023-09-04 03:53:21,691] [INFO] [config.py:967:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2023-09-04 03:53:21,691] [INFO] [config.py:967:print] pld_enabled .................. False
[2023-09-04 03:53:21,691] [INFO] [config.py:967:print] pld_params ................... False
[2023-09-04 03:53:21,691] [INFO] [config.py:967:print] prescale_gradients ........... False
[2023-09-04 03:53:21,691] [INFO] [config.py:967:print] scheduler_name ............... None
[2023-09-04 03:53:21,691] [INFO] [config.py:967:print] scheduler_params ............. None
[2023-09-04 03:53:21,691] [INFO] [config.py:967:print] sparse_attention ............. None
[2023-09-04 03:53:21,691] [INFO] [config.py:967:print] sparse_gradients_enabled ..... False
[2023-09-04 03:53:21,691] [INFO] [config.py:967:print] steps_per_print .............. 10
[2023-09-04 03:53:21,691] [INFO] [config.py:967:print] train_batch_size ............. 6
[2023-09-04 03:53:21,691] [INFO] [config.py:967:print] train_micro_batch_size_per_gpu 1
[2023-09-04 03:53:21,691] [INFO] [config.py:967:print] use_node_local_storage ....... False
[2023-09-04 03:53:21,691] [INFO] [config.py:967:print] wall_clock_breakdown ......... False
[2023-09-04 03:53:21,691] [INFO] [config.py:967:print] world_size ................... 6
[2023-09-04 03:53:21,691] [INFO] [config.py:967:print] zero_allow_untested_optimizer False
[2023-09-04 03:53:21,691] [INFO] [config.py:967:print] zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500,000,000 allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='cpu', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=False) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='cpu', nvme_path=None, buffer_count=4, pin_memory=False, pipeline=False, pipeline_read=False, pipeline_write=False, fast_init=False) sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=30000000 param_persistence_threshold=10000 model_persistence_threshold=sys.maxsize max_live_parameters=30000000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=False pipeline_loading_checkpoint=False override_module_apply=True
[2023-09-04 03:53:21,691] [INFO] [config.py:967:print] zero_enabled ................. True
[2023-09-04 03:53:21,691] [INFO] [config.py:967:print] zero_force_ds_cpu_optimizer .. True
[2023-09-04 03:53:21,691] [INFO] [config.py:967:print] zero_optimization_stage ...... 3
[2023-09-04 03:53:21,691] [INFO] [config.py:953:print_user_config] json = {
"train_batch_size": 6,
"train_micro_batch_size_per_gpu": 1,
"steps_per_print": 10,
"zero_optimization": {
"stage": 3,
"offload_param": {
"device": "cpu"
},
"offload_optimizer": {
"device": "cpu"
},
"stage3_param_persistence_threshold": 1.000000e+04,
"stage3_max_live_parameters": 3.000000e+07,
"stage3_prefetch_bucket_size": 3.000000e+07,
"memory_efficient_linear": false
},
"bf16": {
"enabled": true,
"loss_scale_window": 100
},
"gradient_clipping": 1.0,
"prescale_gradients": false,
"wall_clock_breakdown": false,
"hybrid_engine": {
"enabled": false,
"max_out_tokens": 512,
"inference_tp_size": 1,
"release_inference_cache": false,
"pin_parameters": true,
"tp_gather_partition_size": 8
},
"tensorboard": {
"enabled": false,
"output_path": "step3_tensorboard/ds_tensorboard_logs/",
"job_name": "step3_critic_tensorboard"
}
}
[end] Initialized Critic Model [end] (duration: 83.27s)*
***********************[start] Initializing Reward Model [start] ************************
[2023-09-04 03:53:22,920] [INFO] [partition_parameters.py:342:exit] finished initializing model - num_params = 1162, num_elems = 26.69B
Creating model from_config took 1.2404227256774902 seconds
torch.load took 8.468313455581665 seconds
Loading model state dict took 7.79913854598999 seconds
[2023-09-04 03:53:41,075] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.10.2, git-hash=unknown, git-branch=unknown
[2023-09-04 03:53:41,091] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
[2023-09-04 03:53:41,093] [INFO] [logging.py:96:log_dist] [Rank 0] Creating ZeRO Offload
[2023-09-04 03:53:41,403] [INFO] [utils.py:803:see_memory_usage] DeepSpeedZeRoOffload initialize [begin]
[2023-09-04 03:53:41,403] [INFO] [utils.py:804:see_memory_usage] MA 6.19 GB Max_MA 6.74 GB CA 17.56 GB Max_CA 18 GB
[2023-09-04 03:53:41,403] [INFO] [utils.py:811:see_memory_usage] CPU Virtual Memory: used = 318.4 GB, percent = 63.3%
Parameter Offload: Total persistent parameters: 270336 in 66 params
[2023-09-04 03:53:41,661] [INFO] [utils.py:803:see_memory_usage] DeepSpeedZeRoOffload initialize [end]
[2023-09-04 03:53:41,662] [INFO] [utils.py:804:see_memory_usage] MA 6.19 GB Max_MA 6.19 GB CA 17.56 GB Max_CA 18 GB
[2023-09-04 03:53:41,662] [INFO] [utils.py:811:see_memory_usage] CPU Virtual Memory: used = 318.41 GB, percent = 63.3%
[2023-09-04 03:53:41,663] [INFO] [config.py:963:print] DeepSpeedEngine configuration:
[2023-09-04 03:53:41,663] [INFO] [config.py:967:print] activation_checkpointing_config {
"partition_activations": false,
"contiguous_memory_optimization": false,
"cpu_checkpointing": false,
"number_checkpoints": null,
"synchronize_checkpoint_boundary": false,
"profile": false
}
[2023-09-04 03:53:41,663] [INFO] [config.py:967:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2023-09-04 03:53:41,663] [INFO] [config.py:967:print] amp_enabled .................. False
[2023-09-04 03:53:41,663] [INFO] [config.py:967:print] amp_params ................... False
[2023-09-04 03:53:41,663] [INFO] [config.py:967:print] autotuning_config ............ {
"enabled": false,
"start_step": null,
"end_step": null,
"metric_path": null,
"arg_mappings": null,
"metric": "throughput",
"model_info": null,
"results_dir": "autotuning_results",
"exps_dir": "autotuning_exps",
"overwrite": true,
"fast": true,
"start_profile_step": 3,
"end_profile_step": 5,
"tuner_type": "gridsearch",
"tuner_early_stopping": 5,
"tuner_num_trials": 50,
"model_info_path": null,
"mp_size": 1,
"max_train_batch_size": null,
"min_train_batch_size": 1,
"max_train_micro_batch_size_per_gpu": 1.024000e+03,
"min_train_micro_batch_size_per_gpu": 1,
"num_tuning_micro_batch_sizes": 3
}
[2023-09-04 03:53:41,664] [INFO] [config.py:967:print] bfloat16_enabled ............. False
[2023-09-04 03:53:41,664] [INFO] [config.py:967:print] checkpoint_parallel_write_pipeline False
[2023-09-04 03:53:41,664] [INFO] [config.py:967:print] checkpoint_tag_validation_enabled True
[2023-09-04 03:53:41,664] [INFO] [config.py:967:print] checkpoint_tag_validation_fail False
[2023-09-04 03:53:41,664] [INFO] [config.py:967:print] comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7f6770b9b940>
[2023-09-04 03:53:41,664] [INFO] [config.py:967:print] communication_data_type ...... None
[2023-09-04 03:53:41,664] [INFO] [config.py:967:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2023-09-04 03:53:41,664] [INFO] [config.py:967:print] curriculum_enabled_legacy .... False
[2023-09-04 03:53:41,664] [INFO] [config.py:967:print] curriculum_params_legacy ..... False
[2023-09-04 03:53:41,664] [INFO] [config.py:967:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2023-09-04 03:53:41,664] [INFO] [config.py:967:print] data_efficiency_enabled ...... False
[2023-09-04 03:53:41,664] [INFO] [config.py:967:print] dataloader_drop_last ......... False
[2023-09-04 03:53:41,664] [INFO] [config.py:967:print] disable_allgather ............ False
[2023-09-04 03:53:41,664] [INFO] [config.py:967:print] dump_state ................... False
[2023-09-04 03:53:41,664] [INFO] [config.py:967:print] dynamic_loss_scale_args ...... None
[2023-09-04 03:53:41,664] [INFO] [config.py:967:print] eigenvalue_enabled ........... False
[2023-09-04 03:53:41,664] [INFO] [config.py:967:print] eigenvalue_gas_boundary_resolution 1
[2023-09-04 03:53:41,664] [INFO] [config.py:967:print] eigenvalue_layer_name ........ bert.encoder.layer
[2023-09-04 03:53:41,664] [INFO] [config.py:967:print] eigenvalue_layer_num ......... 0
[2023-09-04 03:53:41,664] [INFO] [config.py:967:print] eigenvalue_max_iter .......... 100
[2023-09-04 03:53:41,664] [INFO] [config.py:967:print] eigenvalue_stability ......... 1e-06
[2023-09-04 03:53:41,664] [INFO] [config.py:967:print] eigenvalue_tol ............... 0.01
[2023-09-04 03:53:41,664] [INFO] [config.py:967:print] eigenvalue_verbose ........... False
[2023-09-04 03:53:41,664] [INFO] [config.py:967:print] elasticity_enabled ........... False
[2023-09-04 03:53:41,664] [INFO] [config.py:967:print] flops_profiler_config ........ {
"enabled": false,
"recompute_fwd_factor": 0.0,
"profile_step": 1,
"module_depth": -1,
"top_modules": 1,
"detailed": true,
"output_file": null
}
[2023-09-04 03:53:41,664] [INFO] [config.py:967:print] fp16_auto_cast ............... False
[2023-09-04 03:53:41,664] [INFO] [config.py:967:print] fp16_enabled ................. True
[2023-09-04 03:53:41,664] [INFO] [config.py:967:print] fp16_master_weights_and_gradients False
[2023-09-04 03:53:41,664] [INFO] [config.py:967:print] global_rank .................. 0
[2023-09-04 03:53:41,664] [INFO] [config.py:967:print] grad_accum_dtype ............. None
[2023-09-04 03:53:41,664] [INFO] [config.py:967:print] gradient_accumulation_steps .. 1
[2023-09-04 03:53:41,664] [INFO] [config.py:967:print] gradient_clipping ............ 1.0
[2023-09-04 03:53:41,664] [INFO] [config.py:967:print] gradient_predivide_factor .... 1.0
[2023-09-04 03:53:41,664] [INFO] [config.py:967:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
[2023-09-04 03:53:41,664] [INFO] [config.py:967:print] initial_dynamic_scale ........ 65536
[2023-09-04 03:53:41,664] [INFO] [config.py:967:print] load_universal_checkpoint .... False
[2023-09-04 03:53:41,664] [INFO] [config.py:967:print] loss_scale ................... 0
[2023-09-04 03:53:41,664] [INFO] [config.py:967:print] memory_breakdown ............. False
[2023-09-04 03:53:41,664] [INFO] [config.py:967:print] mics_hierarchial_params_gather False
[2023-09-04 03:53:41,665] [INFO] [config.py:967:print] mics_shard_size .............. -1
[2023-09-04 03:53:41,665] [INFO] [config.py:967:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
[2023-09-04 03:53:41,665] [INFO] [config.py:967:print] nebula_config ................ {
"enabled": false,
"persistent_storage_path": null,
"persistent_time_interval": 100,
"num_of_version_in_retention": 2,
"enable_nebula_load": true,
"load_path": null
}
[2023-09-04 03:53:41,665] [INFO] [config.py:967:print] optimizer_legacy_fusion ...... False
[2023-09-04 03:53:41,665] [INFO] [config.py:967:print] optimizer_name ............... None
[2023-09-04 03:53:41,665] [INFO] [config.py:967:print] optimizer_params ............. None
[2023-09-04 03:53:41,665] [INFO] [config.py:967:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2023-09-04 03:53:41,665] [INFO] [config.py:967:print] pld_enabled .................. False
[2023-09-04 03:53:41,665] [INFO] [config.py:967:print] pld_params ................... False
[2023-09-04 03:53:41,665] [INFO] [config.py:967:print] prescale_gradients ........... False
[2023-09-04 03:53:41,665] [INFO] [config.py:967:print] scheduler_name ............... None
[2023-09-04 03:53:41,665] [INFO] [config.py:967:print] scheduler_params ............. None
[2023-09-04 03:53:41,665] [INFO] [config.py:967:print] sparse_attention ............. None
[2023-09-04 03:53:41,665] [INFO] [config.py:967:print] sparse_gradients_enabled ..... False
[2023-09-04 03:53:41,665] [INFO] [config.py:967:print] steps_per_print .............. 10
[2023-09-04 03:53:41,665] [INFO] [config.py:967:print] train_batch_size ............. 6
[2023-09-04 03:53:41,665] [INFO] [config.py:967:print] train_micro_batch_size_per_gpu 1
[2023-09-04 03:53:41,665] [INFO] [config.py:967:print] use_node_local_storage ....... False
[2023-09-04 03:53:41,665] [INFO] [config.py:967:print] wall_clock_breakdown ......... False
[2023-09-04 03:53:41,665] [INFO] [config.py:967:print] world_size ................... 6
[2023-09-04 03:53:41,665] [INFO] [config.py:967:print] zero_allow_untested_optimizer False
[2023-09-04 03:53:41,665] [INFO] [config.py:967:print] zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500,000,000 allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='cpu', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=False) offload_optimizer=None sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50,000,000 param_persistence_threshold=10000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=False pipeline_loading_checkpoint=False override_module_apply=True
[2023-09-04 03:53:41,665] [INFO] [config.py:967:print] zero_enabled ................. True
[2023-09-04 03:53:41,665] [INFO] [config.py:967:print] zero_force_ds_cpu_optimizer .. True
[2023-09-04 03:53:41,665] [INFO] [config.py:967:print] zero_optimization_stage ...... 3
[2023-09-04 03:53:41,665] [INFO] [config.py:953:print_user_config] json = {
"train_batch_size": 6,
"train_micro_batch_size_per_gpu": 1,
"steps_per_print": 10,
"zero_optimization": {
"stage": 3,
"stage3_param_persistence_threshold": 1.000000e+04,
"offload_param": {
"device": "cpu"
},
"memory_efficient_linear": false
},
"fp16": {
"enabled": true
},
"gradient_clipping": 1.0,
"prescale_gradients": false,
"wall_clock_breakdown": false
}
[end] Initialized Reward Model [end] (duration: 19.97s)

***** Running training *****
Beginning of Epoch 1/1, Total Generation Batches 5084
/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")
/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")
/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")
/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")
/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")
/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")
Traceback (most recent call last):
File "/xxxxxx/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 632, in
Traceback (most recent call last):
File "/xxxxxx/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 632, in
Traceback (most recent call last):
File "/xxxxxx/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 632, in
Traceback (most recent call last):
File "/xxxxxx/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 632, in
main()
File "/xxxxxx/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 492, in main
main()main()

  File "/xxxxxx/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 492, in main

File "/xxxxxx/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 492, in main
main()Traceback (most recent call last):

File "/xxxxxx/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 492, in main
File "/xxxxxx/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 632, in
Traceback (most recent call last):
File "/xxxxxx/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 632, in
out = trainer.generate_experience(batch_prompt['prompt'],
File "/xxxxxx/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 135, in generate_experience
out = trainer.generate_experience(batch_prompt['prompt'],out = trainer.generate_experience(batch_prompt['prompt'],

  File "/xxxxxx/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 135, in generate_experience

File "/xxxxxx/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 135, in generate_experience
out = trainer.generate_experience(batch_prompt['prompt'],
File "/xxxxxx/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 135, in generate_experience
main()
File "/xxxxxx/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 492, in main
main()
File "/xxxxxx/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 492, in main
out = trainer.generate_experience(batch_prompt['prompt'],
File "/xxxxxx/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 135, in generate_experience
out = trainer.generate_experience(batch_prompt['prompt'],
File "/xxxxxx/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 135, in generate_experience
values = self.critic_model.forward_value( values = self.critic_model.forward_value( values = self.critic_model.forward_value(values = self.critic_model.forward_value(
values = self.critic_model.forward_value(
values = self.critic_model.forward_value(

File "/xxxxxx/DeepSpeedExamples/applications/DeepSpeed-Chat/training/utils/model/reward_model.py", line 135, in forward_value
File "/xxxxxx/DeepSpeedExamples/applications/DeepSpeed-Chat/training/utils/model/reward_model.py", line 135, in forward_value

File "/xxxxxx/DeepSpeedExamples/applications/DeepSpeed-Chat/training/utils/model/reward_model.py", line 135, in forward_value
File "/xxxxxx/DeepSpeedExamples/applications/DeepSpeed-Chat/training/utils/model/reward_model.py", line 135, in forward_value

File "/xxxxxx/DeepSpeedExamples/applications/DeepSpeed-Chat/training/utils/model/reward_model.py", line 135, in forward_value
File "/xxxxxx/DeepSpeedExamples/applications/DeepSpeed-Chat/training/utils/model/reward_model.py", line 135, in forward_value
transformer_outputs = self.rwtranrsformer( transformer_outputs = self.rwtranrsformer(
transformer_outputs = self.rwtranrsformer(transformer_outputs = self.rwtranrsformer(
File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
transformer_outputs = self.rwtranrsformer(transformer_outputs = self.rwtranrsformer(

File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl

File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
result = forward_call(*input, **kwargs)result = forward_call(*input, **kwargs)

  File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 685, in forward

File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 685, in forward
result = forward_call(*input, **kwargs)
File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 685, in forward
result = forward_call(*input, **kwargs)result = forward_call(*input, **kwargs)

result = forward_call(*input, **kwargs) File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 685, in forward
File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 685, in forward

File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 685, in forward
layer_outputs = torch.utils.checkpoint.checkpoint(
File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint
layer_outputs = torch.utils.checkpoint.checkpoint(layer_outputs = torch.utils.checkpoint.checkpoint(layer_outputs = torch.utils.checkpoint.checkpoint(layer_outputs = torch.utils.checkpoint.checkpoint(

File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint
File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint
File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint
layer_outputs = torch.utils.checkpoint.checkpoint( File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint

File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint
return CheckpointFunction.apply(function, preserve, *args)
File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 107, in forward
return CheckpointFunction.apply(function, preserve, *args)
return CheckpointFunction.apply(function, preserve, *args) outputs = run_function(*args) File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 107, in forward

return CheckpointFunction.apply(function, preserve, *args)

return CheckpointFunction.apply(function, preserve, *args) File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 107, in forward
return CheckpointFunction.apply(function, preserve, *args)
File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 681, in custom_forward

File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 107, in forward
File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 107, in forward
File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 107, in forward
outputs = run_function(*args)
outputs = run_function(*args) File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 681, in custom_forward

File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 681, in custom_forward
outputs = run_function(*args)
outputs = run_function(*args) File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 681, in custom_forward

  File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 681, in custom_forward

outputs = run_function(*args)
File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 681, in custom_forward
return module(*inputs, output_attentions, None)
File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
return module(*inputs, output_attentions, None)return module(*inputs, output_attentions, None)

File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
return module(*inputs, output_attentions, None)
File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
return module(*inputs, output_attentions, None)
File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
return module(*inputs, output_attentions, None)
File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
result = forward_call(*input, **kwargs)
File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 408, in forward
result = forward_call(*input, **kwargs)result = forward_call(*input, **kwargs)result = forward_call(*input, **kwargs)

result = forward_call(*input, **kwargs) File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 408, in forward
File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 408, in forward
File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 408, in forward

result = forward_call(*input, **kwargs)hidden_states, self_attn_weights, present_key_value = self.self_attn( File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 408, in forward

File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 408, in forward
File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
hidden_states, self_attn_weights, present_key_value = self.self_attn(hidden_states, self_attn_weights, present_key_value = self.self_attn(hidden_states, self_attn_weights, present_key_value = self.self_attn(

File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
result = forward_call(*input, **kwargs)
File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 305, in forward
query_states = self.q_proj(hidden_states)
File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1201, in _call_impl
result = forward_call(*input, **kwargs)result = forward_call(*input, **kwargs)result = forward_call(*input, **kwargs)result = forward_call(*input, **kwargs)

File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 305, in forward
result = forward_call(*input, **kwargs) File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 305, in forward
File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 305, in forward
File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 305, in forward

File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 305, in forward
query_states = self.q_proj(hidden_states)
File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1201, in _call_impl
query_states = self.q_proj(hidden_states) query_states = self.q_proj(hidden_states)
query_states = self.q_proj(hidden_states)
File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1201, in _call_impl
query_states = self.q_proj(hidden_states)
File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1201, in _call_impl

  File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1201, in _call_impl

File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1201, in _call_impl
result = hook(self, input)
File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 392, in _pre_forward_module_hook
result = hook(self, input)self.pre_sub_module_forward_function(module)

  File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn

File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 505, in pre_sub_module_forward_function
result = hook(self, input)
result = hook(self, input) File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn

result = hook(self, input) File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn

result = hook(self, input) File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn

ret_val = func(*args, **kwargs) File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn

ret_val = func(*args, **kwargs)ret_val = func(*args, **kwargs) File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 392, in _pre_forward_module_hook

ret_val = func(*args, **kwargs) File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 392, in _pre_forward_module_hook
File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 392, in _pre_forward_module_hook

File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 392, in _pre_forward_module_hook
param_coordinator.fetch_sub_module(sub_module, forward=prev_grad_state)ret_val = func(*args, **kwargs)

File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 392, in _pre_forward_module_hook
ret_val = func(*args, **kwargs)
File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
self.pre_sub_module_forward_function(module)self.pre_sub_module_forward_function(module)self.pre_sub_module_forward_function(module)self.pre_sub_module_forward_function(module)

  File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 505, in pre_sub_module_forward_function

File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 505, in pre_sub_module_forward_function
File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 505, in pre_sub_module_forward_function
File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 505, in pre_sub_module_forward_function
return func(*args, **kwargs)self.pre_sub_module_forward_function(module)

File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 284, in fetch_sub_module
File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 505, in pre_sub_module_forward_function
param_coordinator.fetch_sub_module(sub_module, forward=prev_grad_state) param_coordinator.fetch_sub_module(sub_module, forward=prev_grad_state)param_coordinator.fetch_sub_module(sub_module, forward=prev_grad_state)param_coordinator.fetch_sub_module(sub_module, forward=prev_grad_state)
self.__all_gather_params(params_to_fetch, forward)

  File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn

File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
param_coordinator.fetch_sub_module(sub_module, forward=prev_grad_state) File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn

File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
ret_val = func(*args, **kwargs) ret_val = func(*args, **kwargs)
ret_val = func(*args, **kwargs)
File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context

File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
ret_val = func(*args, **kwargs) ret_val = func(*args, **kwargs) File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context

return func(*args, **kwargs)
File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 428, in __all_gather_params

File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs) File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 284, in fetch_sub_module
return func(*args, **kwargs)
return func(*args, **kwargs)
File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 284, in fetch_sub_module

File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 284, in fetch_sub_module
File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 284, in fetch_sub_module
return func(*args, **kwargs)
File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 284, in fetch_sub_module
self.__all_gather_params(params_to_fetch, forward)
File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
self._all_gather_params(nonquantized_params, forward, quantize=self.zero_quantized_weights)
self.__all_gather_params(params_to_fetch, forward) File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 446, in _all_gather_params

self.__all_gather_params(params_to_fetch, forward)self.__all_gather_params(params_to_fetch, forward)ret_val = func(*args, **kwargs) File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn

File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
self.__all_gather_params(params_to_fetch, forward) File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 428, in __all_gather_params

  File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn

ret_val = func(*args, **kwargs)
File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 428, in __all_gather_params
ret_val = func(*args, **kwargs)ret_val = func(*args, **kwargs)

File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 428, in __all_gather_params
File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 428, in __all_gather_params
handle = partitioned_params[0].all_gather_coalesced(partitioned_params,ret_val = func(*args, **kwargs)

File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 428, in __all_gather_params
self._all_gather_params(nonquantized_params, forward, quantize=self.zero_quantized_weights)
ret_val = func(*args, **kwargs) File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 446, in _all_gather_params

  File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1132, in all_gather_coalesced

self._all_gather_params(nonquantized_params, forward, quantize=self.zero_quantized_weights)
self._all_gather_params(nonquantized_params, forward, quantize=self.zero_quantized_weights)self._all_gather_params(nonquantized_params, forward, quantize=self.zero_quantized_weights) File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 446, in _all_gather_params

File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 446, in _all_gather_params
File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 446, in _all_gather_params
self._all_gather_params(nonquantized_params, forward, quantize=self.zero_quantized_weights)
File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 446, in _all_gather_params
handle = partitioned_params[0].all_gather_coalesced(partitioned_params,
File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
handle = partitioned_params[0].all_gather_coalesced(partitioned_params, ret_val = func(*args, **kwargs)
handle = partitioned_params[0].all_gather_coalesced(partitioned_params,handle = partitioned_params[0].all_gather_coalesced(partitioned_params,
File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn

  File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1132, in all_gather_coalesced

File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
handle = partitioned_params[0].all_gather_coalesced(partitioned_params,
ret_val = func(*args, **kwargs) File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn

  File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1132, in all_gather_coalesced
ret_val = func(*args, **kwargs)ret_val = func(*args, **kwargs)

File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1132, in all_gather_coalesced
File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1132, in all_gather_coalesced
ret_val = func(*args, **kwargs)
File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1132, in all_gather_coalesced
dtype=get_only_unique_item(p.ds_tensor.dtype dtype=get_only_unique_item(p.ds_tensor.dtype
dtype=get_only_unique_item(p.ds_tensor.dtypedtype=get_only_unique_item(p.ds_tensor.dtypedtype=get_only_unique_item(p.ds_tensor.dtype
dtype=get_only_unique_item(p.ds_tensor.dtype File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/utils.py", line 842, in get_only_unique_item

File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/utils.py", line 842, in get_only_unique_item

File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/utils.py", line 842, in get_only_unique_item
File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/utils.py", line 842, in get_only_unique_item
File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/utils.py", line 842, in get_only_unique_item
File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/utils.py", line 842, in get_only_unique_item
raise RuntimeError(f"expected there to be only one unique element in {items}") raise RuntimeError(f"expected there to be only one unique element in {items}")raise RuntimeError(f"expected there to be only one unique element in {items}")raise RuntimeError(f"expected there to be only one unique element in {items}")
raise RuntimeError(f"expected there to be only one unique element in {items}")

RuntimeErrorRuntimeErrorraise RuntimeError(f"expected there to be only one unique element in {items}")RuntimeErrorRuntimeError: : RuntimeError
: : expected there to be only one unique element in <generator object Init._convert_to_deepspeed_param..all_gather_coalesced.. at 0x7fd38025bd80>expected there to be only one unique element in <generator object Init._convert_to_deepspeed_param..all_gather_coalesced.. at 0x7f25da167d80>: expected there to be only one unique element in <generator object Init._convert_to_deepspeed_param..all_gather_coalesced.. at 0x7f671c161310>expected there to be only one unique element in <generator object Init._convert_to_deepspeed_param..all_gather_coalesced.. at 0x7fdd842e7d80>RuntimeError

expected there to be only one unique element in <generator object Init._convert_to_deepspeed_param..all_gather_coalesced.. at 0x7fbc541fbd80>

:
expected there to be only one unique element in <generator object Init._convert_to_deepspeed_param..all_gather_coalesced.. at 0x7f697870bd80>
[2023-09-04 03:59:24,656] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 53804
[2023-09-04 03:59:24,670] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 53805
[2023-09-04 03:59:25,169] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 53806
[2023-09-04 03:59:25,176] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 53807
[2023-09-04 03:59:25,176] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 53808
[2023-09-04 03:59:25,182] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 53809
[2023-09-04 03:59:25,186] [ERROR] [launch.py:321:sigkill_handler] ['/xxxxxx/env/mplug_owl/bin/python', '-u', 'main.py', '--local_rank=5', '--data_path', 'Dahoas/rm-static', '--data_split', '2,4,4', '--actor_model_name_or_path', '/xxxxxx/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/model_output', '--critic_model_name_or_path', '/xxxxxx/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step2_reward_model_finetuning/model_output', '--num_padding_at_beginning', '1', '--per_device_generation_batch_size', '1', '--per_device_training_batch_size', '1', '--generation_batches', '1', '--ppo_epochs', '1', '--max_answer_seq_len', '256', '--max_prompt_seq_len', '256', '--actor_learning_rate', '9.65e-6', '--critic_learning_rate', '5e-6', '--actor_weight_decay', '0.1', '--critic_weight_decay', '0.1', '--num_train_epochs', '1', '--lr_scheduler_type', 'cosine', '--gradient_accumulation_steps', '1', '--actor_gradient_checkpointing', '--critic_gradient_checkpointing', '--offload_reference_model', '--disable_actor_dropout', '--num_warmup_steps', '100', '--deepspeed', '--seed', '1234', '--actor_zero_stage', '3', '--critic_zero_stage', '3', '--actor_lora_dim', '64', '--critic_lora_dim', '64', '--critic_lora_module_name', 'layers.', '--actor_lora_module_name', 'layers.', '--output_dir', '/xxxxxx/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/model_output', '--offload', '--offload_reference_model'] exits with return code = 1

this is my pip list:
absl-py 1.4.0
accelerate 0.20.3
aiodns 3.0.0
aiofiles 23.1.0
aiohttp 3.8.4
aiosignal 1.3.1
altair 4.2.2
anyio 3.6.2
apex 0.1
appdirs 1.4.4
APScheduler 3.10.1
asttokens 2.2.1
async-timeout 4.0.2
attrs 23.1.0
backcall 0.2.0
blinker 1.6.2
brotlipy 0.7.0
cachetools 4.2.4
cchardet 2.1.7
certifi 2023.5.7
cffi 1.15.1
charset-normalizer 2.0.4
cityhash 0.2.4.post11
click 8.1.3
colorama 0.4.6
contourpy 1.0.7
cpm-kernels 1.0.11
cryptography 39.0.1
cycler 0.11.0
datasets 2.12.0
dbpool 1.2.1
decorator 5.1.1
deepspeed 0.10.2
dill 0.3.6
docker-pycreds 0.4.0
dotmap 1.3.30
dsc-auth 0.1.16
einops 0.6.1
entrypoints 0.4
executing 1.2.0
fastapi 0.95.1
ffmpy 0.3.0
filelock 3.12.0
fire 0.5.0
Flask 2.3.2
fonttools 4.39.3
frozenlist 1.3.3
fsspec 2023.5.0
gitdb 4.0.10
GitPython 3.1.31
google-api-core 2.11.0
google-auth 2.17.3
google-auth-oauthlib 0.4.6
google-cloud-speech 2.19.0
googleapis-common-protos 1.59.0
gradio 3.28.3
gradio_client 0.2.0
grpcio 1.54.2
grpcio-reflection 1.48.2
grpcio-status 1.54.2
h11 0.14.0
h5py 3.8.0
hiredis 2.2.3
hjson 3.1.0
httpcore 0.17.0
httpx 0.24.0
huggingface-hub 0.14.1
icecream 2.1.3
icetk 0.0.7
idna 3.4
infra-component 1.3.7
infra-framework 1.16.9
infra-kconf 1.0.6
infra-kess 1.0.6
infra-keycenter 1.0.1
infra-storage 1.2.3
ipython 8.13.2
itsdangerous 2.1.2
jedi 0.18.2
Jinja2 3.1.2
joblib 1.2.0
jsonschema 4.17.3
kazoo 2.9.0
kiwisolver 1.4.4
linkify-it-py 2.0.2
lxml 4.9.2
lz4 3.1.10
Markdown 3.4.3
markdown-it-py 2.2.0
markdown2 2.4.8
MarkupSafe 2.1.2
matplotlib 3.7.1
matplotlib-inline 0.1.6
mdit-py-plugins 0.3.3
mdurl 0.1.2
mkl-fft 1.3.1
mkl-random 1.2.2
mkl-service 2.4.0
multidict 6.0.4
multiprocess 0.70.14
munch 2.5.0
mysql-connector-python 8.0.31
ninja 1.11.1
numpy 1.24.3
oauthlib 3.2.2
openai 0.27.8
opencv-python 4.7.0.72
orjson 3.8.12
packaging 23.1
pandas 2.0.1
parso 0.8.3
pathtools 0.1.2
peft 0.3.0
pexpect 4.8.0
pickleshare 0.7.5
Pillow 9.4.0
pip 23.0.1
pkgconfig 1.5.5
prettytable 2.5.0
prompt-toolkit 3.0.38
proto-plus 1.22.2
protobuf 3.19.6
psutil 5.9.5
ptyprocess 0.7.0
pure-eval 0.2.2
py-cpuinfo 9.0.0
pyarrow 12.0.0
pyasn1 0.5.0
pyasn1-modules 0.3.0
pycares 4.3.0
pycparser 2.21
pycryptodome 3.18.0
pydantic 1.10.7
pydub 0.25.1
Pygments 2.15.1
pygtrans 1.5.2
pyOpenSSL 23.0.0
pyparsing 3.0.9
pyrsistent 0.19.3
pysmhasher 0.2.5
PySocks 1.7.1
python-dateutil 2.8.2
python-multipart 0.0.6
python-snappy 0.6.1
pytz 2021.3
pytz-deprecation-shim 0.1.0.post0
PyYAML 6.0
redis 4.5.5
regex 2023.5.5
requests 2.29.0
requests-oauthlib 1.3.1
responses 0.18.0
rsa 4.9
ruamel.yaml 0.17.24
ruamel.yaml.clib 0.2.7
safetensors 0.3.1
sconf 0.2.5
semantic-version 2.10.0
sentencepiece 0.1.99
sentry-sdk 1.24.0
setproctitle 1.3.2
setuptools 66.0.0
setuptools-scm 7.1.0
six 1.16.0
smmap 5.0.0
sniffio 1.3.0
sqlparse 0.4.4
stack-data 0.6.2
starlette 0.26.1
SwissArmyTransformer 0.3.6
tabulate 0.9.0
tensorboard 2.7.0
tensorboard-data-server 0.6.1
tensorboard-plugin-wit 1.8.1
tensorboardX 2.6
termcolor 2.3.0
timm 0.9.2
tokenizers 0.13.3
tomli 2.0.1
toolz 0.12.0
torch 1.13.1
torchaudio 0.13.1
torchvision 0.14.1
tornado 6.3.2
tqdm 4.65.0
traitlets 5.9.0
transformers 4.31.0
typing_extensions 4.5.0
tzdata 2023.3
tzlocal 4.3
uc-micro-py 1.0.2
unpaddedbase64 2.1.0
urllib3 1.26.15
uvicorn 0.22.0
wandb 0.15.3
wcwidth 0.2.6
websockets 11.0.3
Werkzeug 2.3.3
wheel 0.38.4
xmltodict 0.12.0
xxhash 3.2.0
yarl 1.9.2

this is my ds_report:
[2023-09-04 06:02:21,219] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
DeepSpeed C++/CUDA extension op report
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja
ninja .................. [OKAY]

op name ................ installed .. compatible
async_io ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
[WARNING] please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]

DeepSpeed general environment info:
torch install path ............... ['/xxxxx/env/mplug_owl/lib/python3.10/site-packages/torch']
torch version .................... 1.13.1
deepspeed install path ........... ['/xxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.10.2, unknown, unknown
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 10.1
deepspeed wheel compiled w. ...... torch 1.13, cuda 11.7
shared memory (/dev/shm) size .... 251.53 GB

@zhouchuanCN
Copy link

zhouchuanCN commented Sep 4, 2023

I have discovered that the issue started at transformers-4.32.0 and transformers-4.31.0 works fine.

I tried the transformers-4.31.0 and it solved my problem. Great thanks to @denizyuret

@awan-10
Copy link
Contributor

awan-10 commented Sep 8, 2023

Hi All, thank you for trying out various things. We are aware of this new problem with ZeRO Stage 3, HF transformers > 4.31.0. Closing this issue as the downgrade resolves the problem. Please reopen with a new error. Also, strongly encourage you to open a new issue if someone is still seeing errors despite downgrading to transformers 4.31.0.

@awan-10 awan-10 closed this as completed Sep 8, 2023
@andrew-zm-ml
Copy link

I am seeing this issue even with transformers==4.31.0.

@moshe-mishan
Copy link

moshe-mishan commented Sep 27, 2023

Also seeing this issue when using either deepspeed 0.10.1 or 0.10.3 with transformers==4.31.0
I run my model using pytorch-lightning==2.0.9

torch==2.0.1+cu118
cuda==11.8

@Wang-Xiaodong1899
Copy link

Same. I also meet the issue even with transformers==4.31.0. Using deepspeed==0.10.3, torch==2.0.1. Any advice?

@alex-athanassakos
Copy link

Also seeing this with transformers==4.31.0

@AAnminy
Copy link

AAnminy commented Jan 14, 2025

i resolve a similar issue in changing stage 3 into stage 2

@Yonglin5170
Copy link

If you're using PEFT, try updating to peft==0.9.0—it fixed the error for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working deepspeed-chat Related to DeepSpeed-Chat
Projects
None yet
Development

No branches or pull requests