Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] RuntimeError: still have inflight params [<bound method Init._convert_to_deepspeed_param.<locals>.ds_summary of Parameter containing: #3156

Closed
marsggbo opened this issue Apr 6, 2023 · 32 comments
Assignees
Labels
bug Something isn't working training

Comments

@marsggbo
Copy link

marsggbo commented Apr 6, 2023

Describe the bug

what's the possible reason for error below

  File "/home/xihe/xinhe/distNAS/DeepspeedNAS/train.py", line 200, in train_zero
    engine.backward(loss)
  File "/home/xihe/xinhe/deepspeed/DeepSpeed/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/xihe/xinhe/deepspeed/DeepSpeed/deepspeed/runtime/engine.py", line 1980, in backward
    self.optimizer.backward(loss, retain_graph=retain_graph)
  File "/home/xihe/xinhe/deepspeed/DeepSpeed/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/xihe/xinhe/deepspeed/DeepSpeed/deepspeed/runtime/zero/stage3.py", line 2088, in backward
    self._get_param_coordinator(training=True).reset_step()
  File "/home/xihe/xinhe/deepspeed/DeepSpeed/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 185, in reset_step
    raise RuntimeError(
RuntimeError: still have inflight params [<bound method Init._convert_to_deepspeed_param.<locals>.ds_summary of Parameter containing:

ds_report output

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
async_io ............... [YES] ...... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [YES] ...... [OKAY]
fused_adam ............. [YES] ...... [OKAY]
fused_lamb ............. [YES] ...... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
sparse_attn ............ [NO] ....... [OKAY]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [YES] ...... [OKAY]
--------------------------------------------------
No CUDA runtime is found, using CUDA_HOME='/cm/extra/Utils/CUDA/11.1.0.0_455.23.05'
DeepSpeed general environment info:
torch install path ............... ['/datasets/xihe/miniconda3/envs/colossal/lib/python3.9/site-packages/torch']
torch version .................... 1.12.1+cu113
deepspeed install path ........... ['/home/xihe/xinhe/deepspeed/DeepSpeed/deepspeed']
deepspeed info ................... 0.8.3+3667758, 3667758, master
torch cuda version ............... 11.3
torch hip version ................ None
nvcc version ..................... 11.1
deepspeed wheel compiled w. ...... torch 1.12, cuda 11.3
@marsggbo marsggbo added bug Something isn't working training labels Apr 6, 2023
@tjruwase
Copy link
Contributor

tjruwase commented Apr 6, 2023

@marsggbo, thanks for reporting this issue. Can you please provide more details to enable reproducing this problem?

@marsggbo
Copy link
Author

marsggbo commented Apr 7, 2023

I use a model with a dynamic forward, below is an example

class ToyNASModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = torch.nn.Conv2d(3, 512, kernel_size=3, stride=1, padding=1, bias=False)
        self.conv2 = torch.nn.Conv2d(512, 1024, kernel_size=3, stride=1, padding=1, bias=False)
        self.conv3 = torch.nn.Conv2d(512, 1024, kernel_size=5, stride=1, padding=2, bias=False)
        self.gavg = torch.nn.AdaptiveAvgPool2d((1, 1))
        self.fc = torch.nn.Linear(1024, 1000, bias=False)
        self.count = 0

    def forward(self, x):
        out = self.conv1(x)
        if self.count % 2 == 0:
            out = self.conv2(out)
        else:
            out = self.conv3(out)
        self.count += 1
        out = self.gavg(out).view(out.size(0), -1)
        out = self.fc(out)
        return out

the ds_config is as below:

{
    "train_micro_batch_size_per_gpu": 32,
    "gradient_accumulation_steps": 1,
    "steps_per_print": 1,
    "optimizer": {
        "type": "Adam",
        "params": {
            "lr": 0.001,
            "betas": [
                0.8,
                0.999
            ],
            "eps": 1e-08,
            "weight_decay": 3e-07
        }
    },
    "deepspeed": {
        "num_gpus": 1
    },
    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": 0,
            "warmup_max_lr": 0.001,
            "warmup_num_steps": 1000
        }
    },
    "zero_optimization": {
        "stage": 3,
        "allgather_partitions": true,
        "reduce_scatter": true,
        "allgather_bucket_size": 50000000,
        "reduce_bucket_size": 50000000,
        "overlap_comm": true,
        "contiguous_gradients": true
    },
    "zero_allow_untested_optimizer": true,
    "fp16": {
        "enabled": true,
        "fp16_master_weights_and_grads": false,
        "loss_scale": 0,
        "loss_scale_window": 500,
        "hysteresis": 2,
        "min_loss_scale": 1,
        "initial_scale_power": 15
    }
}

@tobideusser
Copy link

I have the same issue (using the PyTorch lightning implementation and deepspeed v0.9.0), any pointers on where this originates from would be splendid.

@tjruwase
Copy link
Contributor

tjruwase commented Apr 19, 2023

@marsggbo, thanks for sharing a toy model. However, we need more than this to reproduce the issue. Can you please share script, and code, data, and command line to reproduce the issue?

@tobideusser, can you share repro details?

@tobideusser
Copy link

I'm sorry for taking so long to reply, I was trying to figure out where that error came from.

To start with, this is basically my validation_step:

def validation_step(self, batch: Dict, batch_idx: int) -> Dict:
        predictions = self.model.generate(
            input_ids=batch["prompt_ids"],
            num_beams=1,
            do_sample=False,
            max_new_tokens=54,
        )
        batch["predictions"] = predictions
        batch = self._detach_tensors_in_dict(batch)
        return batch

This is what I found out:

Therefore, my toy example actually runs through and is not helpful so far.

Could you shed some light on what an "Inflight Parameter" actually is? Is it possible to somehow detach them? I tried simply detaching every tensor in the result after model.generate in the validation step, but this does not change the behaviour.

@justHungryMan
Copy link

I have the same problem using zero3 in pytorch lightning and using the generate function. Is there a solution?

@HeyangQin
Copy link
Contributor

Hello @justHungryMan @tobideusser. This issue has been fixed by a collaborative effort with the lightning team. Please update the deepspeed and lightning to apply the fix. Thank you.

@marsggbo if the error is still there even with the latest deepspeed. Please feel free to reopen this issue with a reproduce script.

HeyangQin added a commit that referenced this issue May 25, 2023
@ZJXNEFU
Copy link

ZJXNEFU commented Jun 15, 2023

@HeyangQin I also met this problem when running deepspeed chat

the deepspeed version is 0.9.5

@HeyangQin
Copy link
Contributor

Hello @ZJXNEFU. Could you provide a reproduce script for us to investigate this issue? Thank you

@ZJXNEFU
Copy link

ZJXNEFU commented Jun 16, 2023

Hello @ZJXNEFU. Could you provide a reproduce script for us to investigate this issue? Thank you

# deepspeed-chat step3 single_node run_rl.sh

#!/bin/bash

ACTOR_MODEL_PATH=/model/path   # a model fine-tuned on the bigcode/starcoder
CRITIC_MODEL_PATH=/reward_model_path  # a reward model trained on bigcode/tiny_starcoder_py  by step 2 scripts
ACTOR_ZERO_STAGE=$3
CRITIC_ZERO_STAGE=$4
OUTPUT=$5
if [ "$OUTPUT" == "" ]; then
    OUTPUT=./rl_output
fi
if [ "$ACTOR_ZERO_STAGE" == "" ]; then
    ACTOR_ZERO_STAGE=3
fi
if [ "$CRITIC_ZERO_STAGE" == "" ]; then
    CRITIC_ZERO_STAGE=3
fi
mkdir -p $OUTPUT

Num_Padding_at_Beginning=0 # this is model related

Actor_Lr=5e-4
Critic_Lr=5e-6

GPU_OPTION=" --include localhost:0,1,2,3,4,5,6,7 "

deepspeed --master_port 12346 ${GPU_OPTION} main.py \
   --data_path data/ \
   --data_split 2,4,4 \
   --data_output_path cache/ \
   --actor_model_name_or_path $ACTOR_MODEL_PATH \
   --critic_model_name_or_path $CRITIC_MODEL_PATH \
   --num_padding_at_beginning 0 \
   --per_device_train_batch_size 2 \
   --per_device_mini_train_batch_size 2 \
   --generation_batch_numbers 1 \
   --ppo_epochs 1 \
   --max_answer_seq_len 1024 \
   --max_prompt_seq_len 1024 \
   --actor_learning_rate ${Actor_Lr} \
   --critic_learning_rate ${Critic_Lr} \
   --num_train_epochs 1 \
   --lr_scheduler_type cosine \
   --gradient_accumulation_steps 1 \
   --num_warmup_steps 100 \
   --deepspeed --seed 1234 \
   --enable_hybrid_engine \
   --inference_tp_size 1 \
   --actor_zero_stage $ACTOR_ZERO_STAGE \
   --critic_zero_stage $CRITIC_ZERO_STAGE \
   --actor_gradient_checkpointing \
   --disable_actor_dropout \
   --actor_lora_dim 128 \
   --actor_lora_module_name decoder.layers. \
   --output_dir $OUTPUT

the follow script is deepspeed config

    zero_opt_dict = {
        "stage": 3,
        "offload_param": {
            "device": "none",
        },
        "offload_optimizer": {
            "device": "none",
        },
        "stage3_param_persistence_threshold": 1e4,
        "stage3_max_live_parameters": 1e7,
        "stage3_prefetch_bucket_size": 1e7,
        "memory_efficient_linear": False,
        "reduce_bucket_size": 1e7,
        "allgather_bucket_size": 1e7,
        "stage3_max_reuse_distance": 1e7
    }

   all_config = {
        "train_batch_size": GLOBAL_BATCH_SIZE,
        "train_micro_batch_size_per_gpu": MICRO_BATCH_SIZE,
        "steps_per_print": 10,
        "zero_optimization": zero_opt_dict,
        "fp16": {
            "enabled": True
        },
        "gradient_clipping": 1.0,
        "prescale_gradients": False,
        "wall_clock_breakdown": False
    }


ds_report

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/root/miniconda3/envs/dschat/lib/python3.9/site-packages/torch']
torch version .................... 1.12.1+cu116
deepspeed install path ........... ['/root/miniconda3/envs/dschat/lib/python3.9/site-packages/deepspeed']
deepspeed info ................... 0.9.5+b692d236, b692d236, master
torch cuda version ............... 11.6
torch hip version ................ None
nvcc version ..................... 11.6
deepspeed wheel compiled w. ...... torch 1.12, cuda 11.6

@HeyangQin

@Fhujinwu
Copy link

+1,please tell me how to address this issue. "RuntimeError: still have inflight params [<bound method Init._convert_to_deepspeed_param..ds_summary of Parameter containing"

@dearchill
Copy link

Still have similar inflight params issue with 0.10.0+f8551b43 when running deepspeedchat.

@HeyangQin
Copy link
Contributor

One of our recent fixes #3819 should have fixed this issue. It is not included in the pypi release yet so you need to install deepspeed from source to apply this fix. Please let us know if you still see this inflight issue even with the fix.

@Fhujinwu
Copy link

One of our recent fixes #3819 should have fixed this issue. It is not included in the pypi release yet so you need to install deepspeed from source to apply this fix. Please let us know if you still see this inflight issue even with the fix.

Still have similar inflight params issue with deepspeed_0.10.0+fd1d2c64 when running deepspeedchat. deepspeedai/DeepSpeedExamples#616

@dearchill
Copy link

One of our recent fixes #3819 should have fixed this issue. It is not included in the pypi release yet so you need to install deepspeed from source to apply this fix. Please let us know if you still see this inflight issue even with the fix.

Yes, I installed 0.10.0+f8551b43 from source, but the issue still remained. For my case, the branch "HeyangQin/fix_issue_3156" solved it. Thanks a lot anyway. You can also try the branch "HeyangQin/fix_issue_3156". @Fhujinwu

@Fhujinwu
Copy link

Fhujinwu commented Jul 1, 2023

One of our recent fixes #3819 should have fixed this issue. It is not included in the pypi release yet so you need to install deepspeed from source to apply this fix. Please let us know if you still see this inflight issue even with the fix.

Yes, I installed 0.10.0+f8551b43 from source, but the issue still remained. For my case, the branch "HeyangQin/fix_issue_3156" solved it. Thanks a lot anyway. You can also try the branch "HeyangQin/fix_issue_3156". @Fhujinwu

thank you,the branch "HeyangQin/fix_issue_3156" solved the issue.

@vittorio-perera
Copy link

I came across the same error (RuntimeError: still have inflight params) checking out and installing the branch HeyangQin/fix_issue_3156 resolved it but using the current main did not. Unfortunately the branch is at version 0.9. Any chance it will be merged into main so that the problem is fixed on version 0.10? @HeyangQin

Happy to provide any additional info if it can help 😃

@iamsile
Copy link

iamsile commented Aug 19, 2023

I came across the same error too (RuntimeError: still have inflight params) . When I use deepspeed to train RL on v100 * 8, this bug still exists. I also switch branch to HeyangQin/fix_issue_3156, but it doesn't work.
Happy to provide any additional info if it can help.

my deepspeed vision 0.10.1. @HeyangQin

This is Log output

File "/xxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 259, in train_rlhf
self.critic_model.backward(critic_loss)
File "/xxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/xxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1890, in backward
self.optimizer.backward(loss, retain_graph=retain_graph)
File "/xxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 2031, in backward
self._get_param_coordinator(training=True)
File "/xxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 199, in reset_step
aise RuntimeError(f"still have inflight params "

RuntimeError: RuntimeError: still have inflight params [{'id': 4548, 'status': 'INFLIGHT', 'numel': 412139520, 'ds_numel': 412139520, 'shape': (80496, 5120), 'ds_shape': (80496, 5120), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([103034880])}, {'id': 4035, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 1638400, 'shape': (0,), 'ds_shape': (1280, 1280), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([409600])}, {'id': 4027, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 328960, 'shape': (0,), 'ds_shape': (257, 1280), 'requires_grad': False...........................................

This is my ds_report:

Setting ds_accelerator to cuda (auto detect)

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja
ninja .................. [OKAY]

op name ................ installed .. compatible

[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
[WARNING] please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]

DeepSpeed general environment info:
torch install path ............... ['/xxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch']
torch version .................... 1.13.1
deepspeed install path ........... ['/xxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed-0.9.3+f2d600ba-py3.10.egg/deepspeed']
deepspeed info ................... 0.9.3+f2d600ba, f2d600b, HeyangQin/fix_issue_3156
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 10.1
deepspeed wheel compiled w. ...... torch 1.13, cuda 11.7

@HeyangQin
Copy link
Contributor

Hello @iamsile @vittorio-perera. Could you provide a reproduction script for us to better investigate this issue? Thank you

@vittorio-perera
Copy link

@HeyangQin I'll try to put together a script as soon as possible. thanks for getting back on this!

@iamsile
Copy link

iamsile commented Aug 25, 2023

Hello @iamsile @vittorio-perera. Could you provide a reproduction script for us to better investigate this issue? Thank you

@HeyangQin This is a full record:#4175. I used HeyangQin/fix_issue_3156 to fix it. But I found this branch didn't merge into master. Looking forward to your reply.

@HeyangQin In my lastest test. I found copy HeyangQin/fix_issue_3156 into master hasn't work. In RL training, it only works at step0, after that it must be crash. this is a full report:

reward score --> step=0, rank=2, tensor([0.2354], device='cuda:2', dtype=torch.bfloat16)
reward score --> step=0, rank=0, tensor([0.4492], device='cuda:0', dtype=torch.bfloat16)
reward score --> step=0, rank=1, tensor([0.0601], device='cuda:1', dtype=torch.bfloat16)
use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False...
use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False...
use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False...
/xxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")
/xxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")
/xxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")
Epoch: 0 | Step: 0 | PPO Epoch: 1 | Actor Loss: -0.55859375 | Critic Loss: 0.216796875 | Unsupervised Loss: 0.0
Average reward score: 0.2470703125

Traceback (most recent call last):
File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/main.py", line 672, in
main()
File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/main.py", line 541, in main
out = trainer.generate_experience(batch_prompt['images'],
File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 153, in generate_experience
seq = self._generate_sequence(images, prompts, mask, step)
File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 95, in _generate_sequence
seq = self.actor_model.module.forward(images,
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/peft/peft_model.py", line 296, in forward
return self.get_base_model()(*args, **kwargs)
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
result = forward_call(*input, **kwargs)
File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/modeling_mplug_owl.py", line 1672, in forward
outputs = self.language_model.generate(
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/transformers/generation/utils.py", line 1538, in generate
return self.greedy_search(
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/transformers/generation/utils.py", line 2362, in greedy_search
outputs = self(
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
result = forward_call(*input, **kwargs)
File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 794, in forward
outputs = self.model(
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
result = forward_call(*input, **kwargs)
File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 677, in forward
layer_outputs = decoder_layer(
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
result = forward_call(*input, **kwargs)
File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 372, in forward
hidden_states = self.input_layernorm(hidden_states)
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1201, in _call_impl
result = hook(self, input)
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 383, in _pre_forward_module_hook
self.pre_sub_module_forward_function(module)
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 495, in pre_sub_module_forward_function
param_coordinator.fetch_sub_module(sub_module, forward=True)
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 328, in fetch_sub_module
assert param.ds_status == ZeroParamStatus.AVAILABLE, param.ds_summary()
AssertionError: {'id': 918, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 5120, 'shape': (0,), 'ds_shape': (5120,), 'requires_grad': False, 'grad_shape': None, 'persist': True, 'active_sub_modules': {578}, 'ds_tensor.shape': torch.Size([1707])}
Traceback (most recent call last):
File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/main.py", line 672, in
main()
File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/main.py", line 541, in main
out = trainer.generate_experience(batch_prompt['images'],
File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 153, in generate_experience
seq = self._generate_sequence(images, prompts, mask, step)
File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 95, in _generate_sequence
seq = self.actor_model.module.forward(images,
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/peft/peft_model.py", line 296, in forward
return self.get_base_model()(*args, **kwargs)
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
result = forward_call(*input, **kwargs)
File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/modeling_mplug_owl.py", line 1672, in forward
outputs = self.language_model.generate(
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/transformers/generation/utils.py", line 1538, in generate
return self.greedy_search(
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/transformers/generation/utils.py", line 2362, in greedy_search
outputs = self(
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
result = forward_call(*input, **kwargs)
File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 794, in forward
outputs = self.model(
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
result = forward_call(*input, **kwargs)
File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 677, in forward
layer_outputs = decoder_layer(
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
result = forward_call(*input, **kwargs)
File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 372, in forward
hidden_states = self.input_layernorm(hidden_states)
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1201, in _call_impl
result = hook(self, input)
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 383, in _pre_forward_module_hook
self.pre_sub_module_forward_function(module)
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 495, in pre_sub_module_forward_function
param_coordinator.fetch_sub_module(sub_module, forward=True)
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 328, in fetch_sub_module
assert param.ds_status == ZeroParamStatus.AVAILABLE, param.ds_summary()
AssertionError: {'id': 918, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 5120, 'shape': (0,), 'ds_shape': (5120,), 'requires_grad': False, 'grad_shape': None, 'persist': True, 'active_sub_modules': {578}, 'ds_tensor.shape': torch.Size([1707])}
Traceback (most recent call last):
File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/main.py", line 672, in
main()
File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/main.py", line 541, in main
out = trainer.generate_experience(batch_prompt['images'],
File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 153, in generate_experience
seq = self._generate_sequence(images, prompts, mask, step)
File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 95, in _generate_sequence
seq = self.actor_model.module.forward(images,
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/peft/peft_model.py", line 296, in forward
return self.get_base_model()(*args, **kwargs)
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
result = forward_call(*input, **kwargs)
File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/modeling_mplug_owl.py", line 1672, in forward
outputs = self.language_model.generate(
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/transformers/generation/utils.py", line 1538, in generate
return self.greedy_search(
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/transformers/generation/utils.py", line 2362, in greedy_search
outputs = self(
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
result = forward_call(*input, **kwargs)
File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 794, in forward
outputs = self.model(
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
result = forward_call(*input, **kwargs)
File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 677, in forward
layer_outputs = decoder_layer(
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
result = forward_call(*input, **kwargs)
File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 372, in forward
hidden_states = self.input_layernorm(hidden_states)
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1201, in _call_impl
result = hook(self, input)
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 383, in _pre_forward_module_hook
self.pre_sub_module_forward_function(module)
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 495, in pre_sub_module_forward_function
param_coordinator.fetch_sub_module(sub_module, forward=True)
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 328, in fetch_sub_module
assert param.ds_status == ZeroParamStatus.AVAILABLE, param.ds_summary()
AssertionError: {'id': 918, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 5120, 'shape': (0,), 'ds_shape': (5120,), 'requires_grad': False, 'grad_shape': None, 'persist': True, 'active_sub_modules': {578}, 'ds_tensor.shape': torch.Size([1707])}
[2023-08-28 07:34:58,470] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 46301
[2023-08-28 07:34:58,470] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 46302
[2023-08-28 07:34:58,476] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 46303

@jiahuanluo
Copy link

Hello @iamsile @vittorio-perera. Could you provide a reproduction script for us to better investigate this issue? Thank you

@HeyangQin This is a full record:#4175. I used HeyangQin/fix_issue_3156 to fix it. But I found this branch didn't merge into master. Looking forward to your reply.

@HeyangQin In my lastest test. I found copy HeyangQin/fix_issue_3156 into master hasn't work. In RL training, it only works at step0, after that it must be crash. this is a full report:

reward score --> step=0, rank=2, tensor([0.2354], device='cuda:2', dtype=torch.bfloat16) reward score --> step=0, rank=0, tensor([0.4492], device='cuda:0', dtype=torch.bfloat16) reward score --> step=0, rank=1, tensor([0.0601], device='cuda:1', dtype=torch.bfloat16) use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False... use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False... use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False... /xxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None warnings.warn("None of the inputs have requires_grad=True. Gradients will be None") /xxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None warnings.warn("None of the inputs have requires_grad=True. Gradients will be None") /xxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None warnings.warn("None of the inputs have requires_grad=True. Gradients will be None") Epoch: 0 | Step: 0 | PPO Epoch: 1 | Actor Loss: -0.55859375 | Critic Loss: 0.216796875 | Unsupervised Loss: 0.0 Average reward score: 0.2470703125

Traceback (most recent call last): File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/main.py", line 672, in main() File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/main.py", line 541, in main out = trainer.generate_experience(batch_prompt['images'], File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 153, in generate_experience seq = self._generate_sequence(images, prompts, mask, step) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 95, in _generate_sequence seq = self.actor_model.module.forward(images, File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/peft/peft_model.py", line 296, in forward return self.get_base_model()(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/modeling_mplug_owl.py", line 1672, in forward outputs = self.language_model.generate( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/transformers/generation/utils.py", line 1538, in generate return self.greedy_search( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/transformers/generation/utils.py", line 2362, in greedy_search outputs = self( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 794, in forward outputs = self.model( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 677, in forward layer_outputs = decoder_layer( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 372, in forward hidden_states = self.input_layernorm(hidden_states) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1201, in _call_impl result = hook(self, input) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 383, in _pre_forward_module_hook self.pre_sub_module_forward_function(module) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 495, in pre_sub_module_forward_function param_coordinator.fetch_sub_module(sub_module, forward=True) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 328, in fetch_sub_module assert param.ds_status == ZeroParamStatus.AVAILABLE, param.ds_summary() AssertionError: {'id': 918, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 5120, 'shape': (0,), 'ds_shape': (5120,), 'requires_grad': False, 'grad_shape': None, 'persist': True, 'active_sub_modules': {578}, 'ds_tensor.shape': torch.Size([1707])} Traceback (most recent call last): File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/main.py", line 672, in main() File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/main.py", line 541, in main out = trainer.generate_experience(batch_prompt['images'], File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 153, in generate_experience seq = self._generate_sequence(images, prompts, mask, step) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 95, in _generate_sequence seq = self.actor_model.module.forward(images, File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/peft/peft_model.py", line 296, in forward return self.get_base_model()(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/modeling_mplug_owl.py", line 1672, in forward outputs = self.language_model.generate( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/transformers/generation/utils.py", line 1538, in generate return self.greedy_search( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/transformers/generation/utils.py", line 2362, in greedy_search outputs = self( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 794, in forward outputs = self.model( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 677, in forward layer_outputs = decoder_layer( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 372, in forward hidden_states = self.input_layernorm(hidden_states) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1201, in _call_impl result = hook(self, input) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 383, in _pre_forward_module_hook self.pre_sub_module_forward_function(module) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 495, in pre_sub_module_forward_function param_coordinator.fetch_sub_module(sub_module, forward=True) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 328, in fetch_sub_module assert param.ds_status == ZeroParamStatus.AVAILABLE, param.ds_summary() AssertionError: {'id': 918, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 5120, 'shape': (0,), 'ds_shape': (5120,), 'requires_grad': False, 'grad_shape': None, 'persist': True, 'active_sub_modules': {578}, 'ds_tensor.shape': torch.Size([1707])} Traceback (most recent call last): File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/main.py", line 672, in main() File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/main.py", line 541, in main out = trainer.generate_experience(batch_prompt['images'], File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 153, in generate_experience seq = self._generate_sequence(images, prompts, mask, step) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 95, in _generate_sequence seq = self.actor_model.module.forward(images, File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/peft/peft_model.py", line 296, in forward return self.get_base_model()(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/modeling_mplug_owl.py", line 1672, in forward outputs = self.language_model.generate( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/transformers/generation/utils.py", line 1538, in generate return self.greedy_search( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/transformers/generation/utils.py", line 2362, in greedy_search outputs = self( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 794, in forward outputs = self.model( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 677, in forward layer_outputs = decoder_layer( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 372, in forward hidden_states = self.input_layernorm(hidden_states) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1201, in _call_impl result = hook(self, input) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 383, in _pre_forward_module_hook self.pre_sub_module_forward_function(module) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 495, in pre_sub_module_forward_function param_coordinator.fetch_sub_module(sub_module, forward=True) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 328, in fetch_sub_module assert param.ds_status == ZeroParamStatus.AVAILABLE, param.ds_summary() AssertionError: {'id': 918, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 5120, 'shape': (0,), 'ds_shape': (5120,), 'requires_grad': False, 'grad_shape': None, 'persist': True, 'active_sub_modules': {578}, 'ds_tensor.shape': torch.Size([1707])} [2023-08-28 07:34:58,470] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 46301 [2023-08-28 07:34:58,470] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 46302 [2023-08-28 07:34:58,476] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 46303

Same. Work at step 0 and then crash with raise RuntimeError(f"{param.ds_summary()} already in registry")
RuntimeError: {'id': 0, 'status': 'INFLIGHT', 'numel': 262144000, 'ds_numel': 262144000, 'shape': (64000, 4096), 'ds_shape': (64000, 4096), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([65536000])} already in registry

@iamsile
Copy link

iamsile commented Sep 11, 2023

Hello @iamsile @vittorio-perera. Could you provide a reproduction script for us to better investigate this issue? Thank you

@HeyangQin This is a full record:#4175. I used HeyangQin/fix_issue_3156 to fix it. But I found this branch didn't merge into master. Looking forward to your reply.

@HeyangQin In my lastest test. I found copy HeyangQin/fix_issue_3156 into master hasn't work. In RL training, it only works at step0, after that it must be crash. this is a full report:

reward score --> step=0, rank=2, tensor([0.2354], device='cuda:2', dtype=torch.bfloat16) reward score --> step=0, rank=0, tensor([0.4492], device='cuda:0', dtype=torch.bfloat16) reward score --> step=0, rank=1, tensor([0.0601], device='cuda:1', dtype=torch.bfloat16) use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False... use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False... use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False... /xxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None warnings.warn("None of the inputs have requires_grad=True. Gradients will be None") /xxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None warnings.warn("None of the inputs have requires_grad=True. Gradients will be None") /xxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None warnings.warn("None of the inputs have requires_grad=True. Gradients will be None") Epoch: 0 | Step: 0 | PPO Epoch: 1 | Actor Loss: -0.55859375 | Critic Loss: 0.216796875 | Unsupervised Loss: 0.0 Average reward score: 0.2470703125

Traceback (most recent call last): File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/main.py", line 672, in main() File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/main.py", line 541, in main out = trainer.generate_experience(batch_prompt['images'], File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 153, in generate_experience seq = self._generate_sequence(images, prompts, mask, step) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 95, in _generate_sequence seq = self.actor_model.module.forward(images, File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/peft/peft_model.py", line 296, in forward return self.get_base_model()(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/modeling_mplug_owl.py", line 1672, in forward outputs = self.language_model.generate( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/transformers/generation/utils.py", line 1538, in generate return self.greedy_search( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/transformers/generation/utils.py", line 2362, in greedy_search outputs = self( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 794, in forward outputs = self.model( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 677, in forward layer_outputs = decoder_layer( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 372, in forward hidden_states = self.input_layernorm(hidden_states) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1201, in _call_impl result = hook(self, input) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 383, in _pre_forward_module_hook self.pre_sub_module_forward_function(module) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 495, in pre_sub_module_forward_function param_coordinator.fetch_sub_module(sub_module, forward=True) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 328, in fetch_sub_module assert param.ds_status == ZeroParamStatus.AVAILABLE, param.ds_summary() AssertionError: {'id': 918, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 5120, 'shape': (0,), 'ds_shape': (5120,), 'requires_grad': False, 'grad_shape': None, 'persist': True, 'active_sub_modules': {578}, 'ds_tensor.shape': torch.Size([1707])} Traceback (most recent call last): File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/main.py", line 672, in main() File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/main.py", line 541, in main out = trainer.generate_experience(batch_prompt['images'], File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 153, in generate_experience seq = self._generate_sequence(images, prompts, mask, step) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 95, in _generate_sequence seq = self.actor_model.module.forward(images, File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/peft/peft_model.py", line 296, in forward return self.get_base_model()(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/modeling_mplug_owl.py", line 1672, in forward outputs = self.language_model.generate( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/transformers/generation/utils.py", line 1538, in generate return self.greedy_search( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/transformers/generation/utils.py", line 2362, in greedy_search outputs = self( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 794, in forward outputs = self.model( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 677, in forward layer_outputs = decoder_layer( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 372, in forward hidden_states = self.input_layernorm(hidden_states) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1201, in _call_impl result = hook(self, input) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 383, in _pre_forward_module_hook self.pre_sub_module_forward_function(module) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 495, in pre_sub_module_forward_function param_coordinator.fetch_sub_module(sub_module, forward=True) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 328, in fetch_sub_module assert param.ds_status == ZeroParamStatus.AVAILABLE, param.ds_summary() AssertionError: {'id': 918, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 5120, 'shape': (0,), 'ds_shape': (5120,), 'requires_grad': False, 'grad_shape': None, 'persist': True, 'active_sub_modules': {578}, 'ds_tensor.shape': torch.Size([1707])} Traceback (most recent call last): File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/main.py", line 672, in main() File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/main.py", line 541, in main out = trainer.generate_experience(batch_prompt['images'], File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 153, in generate_experience seq = self._generate_sequence(images, prompts, mask, step) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 95, in _generate_sequence seq = self.actor_model.module.forward(images, File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/peft/peft_model.py", line 296, in forward return self.get_base_model()(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/modeling_mplug_owl.py", line 1672, in forward outputs = self.language_model.generate( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/transformers/generation/utils.py", line 1538, in generate return self.greedy_search( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/transformers/generation/utils.py", line 2362, in greedy_search outputs = self( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 794, in forward outputs = self.model( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 677, in forward layer_outputs = decoder_layer( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 372, in forward hidden_states = self.input_layernorm(hidden_states) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1201, in _call_impl result = hook(self, input) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 383, in _pre_forward_module_hook self.pre_sub_module_forward_function(module) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 495, in pre_sub_module_forward_function param_coordinator.fetch_sub_module(sub_module, forward=True) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 328, in fetch_sub_module assert param.ds_status == ZeroParamStatus.AVAILABLE, param.ds_summary() AssertionError: {'id': 918, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 5120, 'shape': (0,), 'ds_shape': (5120,), 'requires_grad': False, 'grad_shape': None, 'persist': True, 'active_sub_modules': {578}, 'ds_tensor.shape': torch.Size([1707])} [2023-08-28 07:34:58,470] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 46301 [2023-08-28 07:34:58,470] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 46302 [2023-08-28 07:34:58,476] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 46303

I have fixed it and close it.

@jiahuanluo
Copy link

@iamsile Hi, Could you please tell me how to fix this? Many Thanks.

@XenonLamb
Copy link

@iamsile Has your case been resolved with the latest deepspeed version? I observed similar issues recently. typically with a bert model and some linear layers, under zero-3. the training process starts with some "Invalidate trace cache @ step xx: expected module xx, but got module xxx", and then after about 10 steps. it aborts with RuntimeError: still have inflight params [{'id': 1143, 'status': 'AVAILABLE', 'numel': 1982464, 'ds_numel': 1982464, 'shape': (1408, 1408), 'ds_shape': (1408, 1408), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([247808])}, {'id': 1145, 'status': 'AVAILABLE', 'numel': 5767168, 'ds_numel': 5767168, 'shape': (4096, 1408), 'ds_shape': (4096, 1408), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([720896])}]

@dzagardo
Copy link

dzagardo commented Apr 29, 2024

i've also got this issue @XenonLamb , mine happens about 230 steps in at the end of a batch when i'm returning a dummy loss value. not sure what the deal is 🤨

[rank6]: RuntimeError: still have inflight params [{'id': 6, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 1572864, 'shape': (0,), 'ds_shape': (1024, 512, 3), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([196608])}, {'id': 10, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 3145728, 'shape': (0,), 'ds_shape': (3072, 1024), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([393216])}, {'id': 18, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 1048576, 'shape': (0,), 'ds_shape': (1024, 1024), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([131072])}, {'id': 16, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 4194304, 'shape': (0,), 'ds_shape': (4096, 1024), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([524288])}, {'id': 22, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 4194304, 'shape': (0,), 'ds_shape': (1024, 4096), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([524288])}, {'id': 23, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 1048576, 'shape': (0,), 'ds_shape': (1024, 1024), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([131072])}, {'id': 28, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 3145728, 'shape': (0,), 'ds_shape': (3072, 1024), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([393216])}, {'id': 36, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 1048576, 'shape': (0,), 'ds_shape': (1024, 1024), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([131072])}, {'id': 34, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 4194304, 'shape': (0,), 'ds_shape': (4096, 1024), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([524288])}, {'id': 37, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 4194304, 'shape': (0,), 'ds_shape': (4096, 1024), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([524288])}, {'id': 4, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 393216, 'shape': (0,), 'ds_shape': (512, 256, 3), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([49152])}, {'id': 12, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 1048576, 'shape': (0,), 'ds_shape': (1024, 1024), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([131072])}, {'id': 8, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 3145728, 'shape': (0,), 'ds_shape': (1024, 1024, 3), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([393216])}, {'id': 19, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 4194304, 'shape': (0,), 'ds_shape': (4096, 1024), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([524288])}, {'id': 20, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 4194304, 'shape': (0,), 'ds_shape': (1024, 4096), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([524288])}, {'id': 30, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 1048576, 'shape': (0,), 'ds_shape': (1024, 1024), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([131072])}, {'id': 40, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 4194304, 'shape': (0,), 'ds_shape': (1024, 4096), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([524288])}, {'id': 38, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 4194304, 'shape': (0,), 'ds_shape': (1024, 4096), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([524288])}, {'id': 46, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 3145728, 'shape': (0,), 'ds_shape': (3072, 1024), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([393216])}, {'id': 55, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 4194304, 'shape': (0,), 'ds_shape': (4096, 1024), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([524288])}, {'id': 77, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 1048576, 'shape': (0,), 'ds_shape': (1024, 1024), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([131072])}, {'id': 89, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 524288, 'shape': (0,), 'ds_shape': (512, 1024), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([65536])}, {'id': 58, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 4194304, 'shape': (0,), 'ds_shape': (1024, 4096), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([524288])}, {'id': 66, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 1048576, 'shape': (0,), 'ds_shape': (1024, 1024), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([131072])}, {'id': 73, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 4194304, 'shape': (0,), 'ds_shape': (4096, 1024), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([524288])}, {'id': 82, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 262144, 'shape': (0,), 'ds_shape': (512, 512), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([32768])}, {'id': 56, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 4194304, 'shape': (0,), 'ds_shape': (1024, 4096), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([524288])}, {'id': 70, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 4194304, 'shape': (0,), 'ds_shape': (4096, 1024), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([524288])}, {'id': 76, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 4194304, 'shape': (0,), 'ds_shape': (1024, 4096), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([524288])}, {'id': 88, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 524288, 'shape': (0,), 'ds_shape': (1024, 512), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([65536])}, {'id': 64, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 3145728, 'shape': (0,), 'ds_shape': (3072, 1024), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([393216])}, {'id': 85, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 524288, 'shape': (0,), 'ds_shape': (512, 1024), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([65536])}, {'id': 86, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 262144, 'shape': (0,), 'ds_shape': (512, 512), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([32768])}, {'id': 48, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 1048576, 'shape': (0,), 'ds_shape': (1024, 1024), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([131072])}, {'id': 72, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 1048576, 'shape': (0,), 'ds_shape': (1024, 1024), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([131072])}, {'id': 54, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 1048576, 'shape': (0,), 'ds_shape': (1024, 1024), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([131072])}, {'id': 59, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 1048576, 'shape': (0,), 'ds_shape': (1024, 1024), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([131072])}, {'id': 74, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 4194304, 'shape': (0,), 'ds_shape': (1024, 4096), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([524288])}, {'id': 41, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 1048576, 'shape': (0,), 'ds_shape': (1024, 1024), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([131072])}, {'id': 52, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 4194304, 'shape': (0,), 'ds_shape': (4096, 1024), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([524288])}, {'id': 84, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 524288, 'shape': (0,), 'ds_shape': (1024, 512), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([65536])}, {'id': 92, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 1048576, 'shape': (0,), 'ds_shape': (1024, 1024), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([131072])}]

@daehuikim
Copy link

I have same issue when I use zero stage 3 with latest deepspeed version. This issue occurs after first evaluation step. How could I fix it?

@jiahuanluo
Copy link

jiahuanluo commented May 14, 2024

try the latest version driver and cuda.

@cyrildiagne
Copy link

Currently facing the same ussing with zero stage 3 and first validation step in pytorch lightning

@karthik-nexusflow
Copy link

facing the same issue File "/root/miniconda3/envs/open/lib/python3.11/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 351, in _end_of_forward_hook
self.get_param_coordinator(training=False).reset_step()
File "/root/miniconda3/envs/open/lib/python3.11/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 204, in reset_step
raise RuntimeError(f"still have inflight params "
RuntimeError: still have inflight params [{'id': 723, 'status': 'AVAILABLE', 'numel': 4194304, 'ds_numel': 4194304, 'shape': (512, 8192), 'ds_shape': (512, 8192)

@syx11237744
Copy link

facing the same issue File "/root/miniconda3/envs/open/lib/python3.11/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 351, in _end_of_forward_hook self.get_param_coordinator(training=False).reset_step() File "/root/miniconda3/envs/open/lib/python3.11/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 204, in reset_step raise RuntimeError(f"still have inflight params " RuntimeError: still have inflight params [{'id': 723, 'status': 'AVAILABLE', 'numel': 4194304, 'ds_numel': 4194304, 'shape': (512, 8192), 'ds_shape': (512, 8192)

same problem+1

@Osilly
Copy link

Osilly commented Aug 14, 2024

Hello? I have the same problem in training a dynamic forward network. How to solve this problem? Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working training
Projects
None yet
Development

Successfully merging a pull request may close this issue.