[BUG] RuntimeError: still have inflight params [<bound method Init._convert_to_deepspeed_param.<locals>.ds_summary of Parameter containing: #3156

marsggbo · 2023-04-06T15:34:35Z

Describe the bug

what's the possible reason for error below

  File "/home/xihe/xinhe/distNAS/DeepspeedNAS/train.py", line 200, in train_zero
    engine.backward(loss)
  File "/home/xihe/xinhe/deepspeed/DeepSpeed/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/xihe/xinhe/deepspeed/DeepSpeed/deepspeed/runtime/engine.py", line 1980, in backward
    self.optimizer.backward(loss, retain_graph=retain_graph)
  File "/home/xihe/xinhe/deepspeed/DeepSpeed/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/xihe/xinhe/deepspeed/DeepSpeed/deepspeed/runtime/zero/stage3.py", line 2088, in backward
    self._get_param_coordinator(training=True).reset_step()
  File "/home/xihe/xinhe/deepspeed/DeepSpeed/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 185, in reset_step
    raise RuntimeError(
RuntimeError: still have inflight params [<bound method Init._convert_to_deepspeed_param.<locals>.ds_summary of Parameter containing:

ds_report output

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
async_io ............... [YES] ...... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [YES] ...... [OKAY]
fused_adam ............. [YES] ...... [OKAY]
fused_lamb ............. [YES] ...... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
sparse_attn ............ [NO] ....... [OKAY]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [YES] ...... [OKAY]
--------------------------------------------------
No CUDA runtime is found, using CUDA_HOME='/cm/extra/Utils/CUDA/11.1.0.0_455.23.05'
DeepSpeed general environment info:
torch install path ............... ['/datasets/xihe/miniconda3/envs/colossal/lib/python3.9/site-packages/torch']
torch version .................... 1.12.1+cu113
deepspeed install path ........... ['/home/xihe/xinhe/deepspeed/DeepSpeed/deepspeed']
deepspeed info ................... 0.8.3+3667758, 3667758, master
torch cuda version ............... 11.3
torch hip version ................ None
nvcc version ..................... 11.1
deepspeed wheel compiled w. ...... torch 1.12, cuda 11.3

The text was updated successfully, but these errors were encountered:

tjruwase · 2023-04-06T22:50:06Z

@marsggbo, thanks for reporting this issue. Can you please provide more details to enable reproducing this problem?

marsggbo · 2023-04-07T01:52:07Z

I use a model with a dynamic forward, below is an example

class ToyNASModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = torch.nn.Conv2d(3, 512, kernel_size=3, stride=1, padding=1, bias=False)
        self.conv2 = torch.nn.Conv2d(512, 1024, kernel_size=3, stride=1, padding=1, bias=False)
        self.conv3 = torch.nn.Conv2d(512, 1024, kernel_size=5, stride=1, padding=2, bias=False)
        self.gavg = torch.nn.AdaptiveAvgPool2d((1, 1))
        self.fc = torch.nn.Linear(1024, 1000, bias=False)
        self.count = 0

    def forward(self, x):
        out = self.conv1(x)
        if self.count % 2 == 0:
            out = self.conv2(out)
        else:
            out = self.conv3(out)
        self.count += 1
        out = self.gavg(out).view(out.size(0), -1)
        out = self.fc(out)
        return out

the ds_config is as below:

{
    "train_micro_batch_size_per_gpu": 32,
    "gradient_accumulation_steps": 1,
    "steps_per_print": 1,
    "optimizer": {
        "type": "Adam",
        "params": {
            "lr": 0.001,
            "betas": [
                0.8,
                0.999
            ],
            "eps": 1e-08,
            "weight_decay": 3e-07
        }
    },
    "deepspeed": {
        "num_gpus": 1
    },
    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": 0,
            "warmup_max_lr": 0.001,
            "warmup_num_steps": 1000
        }
    },
    "zero_optimization": {
        "stage": 3,
        "allgather_partitions": true,
        "reduce_scatter": true,
        "allgather_bucket_size": 50000000,
        "reduce_bucket_size": 50000000,
        "overlap_comm": true,
        "contiguous_gradients": true
    },
    "zero_allow_untested_optimizer": true,
    "fp16": {
        "enabled": true,
        "fp16_master_weights_and_grads": false,
        "loss_scale": 0,
        "loss_scale_window": 500,
        "hysteresis": 2,
        "min_loss_scale": 1,
        "initial_scale_power": 15
    }
}

tobideusser · 2023-04-19T13:34:45Z

I have the same issue (using the PyTorch lightning implementation and deepspeed v0.9.0), any pointers on where this originates from would be splendid.

tjruwase · 2023-04-19T13:43:47Z

@marsggbo, thanks for sharing a toy model. However, we need more than this to reproduce the issue. Can you please share script, and code, data, and command line to reproduce the issue?

@tobideusser, can you share repro details?

tobideusser · 2023-04-24T08:34:45Z

I'm sorry for taking so long to reply, I was trying to figure out where that error came from.

To start with, this is basically my validation_step:

def validation_step(self, batch: Dict, batch_idx: int) -> Dict:
        predictions = self.model.generate(
            input_ids=batch["prompt_ids"],
            num_beams=1,
            do_sample=False,
            max_new_tokens=54,
        )
        batch["predictions"] = predictions
        batch = self._detach_tensors_in_dict(batch)
        return batch

This is what I found out:

The error appears after the first validation_step of PyTorch Lightning, but before the second one.
In there, I use the generation function of a language model imported from HF (https://huggingface.co/docs/transformers/main_classes/text_generation)
After investigating where this error is raised, I found that during the reset_step it is simply checked if there are still any in flight parameters (see: https://github.com/microsoft/DeepSpeed/blob/39b429d56ef12b3dc82fc177e2f0f801db744a3d/deepspeed/runtime/zero/partitioned_param_coordinator.py#L183), so someone probably has encountered this and wrote an exception for it, I guess?
If I disable the sanity checking of Pytorch Lightning (https://lightning.ai/docs/pytorch/latest/common/trainer.html?highlight=num_sanity_val_steps#num-sanity-val-steps) the first validation step actually runs through but throws an Invalidate trace cache @ step 271 and module 2: cache has only 271 modules print statement (which is printed here: https://github.com/microsoft/DeepSpeed/blob/39b429d56ef12b3dc82fc177e2f0f801db744a3d/deepspeed/runtime/zero/partitioned_param_coordinator.py#L152)
Interestingly, if I create a dummy dataset (and reproducible code) this error does not appear.

Therefore, my toy example actually runs through and is not helpful so far.

Could you shed some light on what an "Inflight Parameter" actually is? Is it possible to somehow detach them? I tried simply detaching every tensor in the result after model.generate in the validation step, but this does not change the behaviour.

justHungryMan · 2023-05-04T12:14:34Z

I have the same problem using zero3 in pytorch lightning and using the generate function. Is there a solution?

HeyangQin · 2023-05-19T18:11:03Z

Hello @justHungryMan @tobideusser. This issue has been fixed by a collaborative effort with the lightning team. Please update the deepspeed and lightning to apply the fix. Thank you.

@marsggbo if the error is still there even with the latest deepspeed. Please feel free to reopen this issue with a reproduce script.

ZJXNEFU · 2023-06-15T10:03:35Z

@HeyangQin I also met this problem when running deepspeed chat

the deepspeed version is 0.9.5

HeyangQin · 2023-06-15T20:06:03Z

Hello @ZJXNEFU. Could you provide a reproduce script for us to investigate this issue? Thank you

ZJXNEFU · 2023-06-16T01:40:36Z

Hello @ZJXNEFU. Could you provide a reproduce script for us to investigate this issue? Thank you

# deepspeed-chat step3 single_node run_rl.sh

#!/bin/bash

ACTOR_MODEL_PATH=/model/path   # a model fine-tuned on the bigcode/starcoder
CRITIC_MODEL_PATH=/reward_model_path  # a reward model trained on bigcode/tiny_starcoder_py  by step 2 scripts
ACTOR_ZERO_STAGE=$3
CRITIC_ZERO_STAGE=$4
OUTPUT=$5
if [ "$OUTPUT" == "" ]; then
    OUTPUT=./rl_output
fi
if [ "$ACTOR_ZERO_STAGE" == "" ]; then
    ACTOR_ZERO_STAGE=3
fi
if [ "$CRITIC_ZERO_STAGE" == "" ]; then
    CRITIC_ZERO_STAGE=3
fi
mkdir -p $OUTPUT

Num_Padding_at_Beginning=0 # this is model related

Actor_Lr=5e-4
Critic_Lr=5e-6

GPU_OPTION=" --include localhost:0,1,2,3,4,5,6,7 "

deepspeed --master_port 12346 ${GPU_OPTION} main.py \
   --data_path data/ \
   --data_split 2,4,4 \
   --data_output_path cache/ \
   --actor_model_name_or_path $ACTOR_MODEL_PATH \
   --critic_model_name_or_path $CRITIC_MODEL_PATH \
   --num_padding_at_beginning 0 \
   --per_device_train_batch_size 2 \
   --per_device_mini_train_batch_size 2 \
   --generation_batch_numbers 1 \
   --ppo_epochs 1 \
   --max_answer_seq_len 1024 \
   --max_prompt_seq_len 1024 \
   --actor_learning_rate ${Actor_Lr} \
   --critic_learning_rate ${Critic_Lr} \
   --num_train_epochs 1 \
   --lr_scheduler_type cosine \
   --gradient_accumulation_steps 1 \
   --num_warmup_steps 100 \
   --deepspeed --seed 1234 \
   --enable_hybrid_engine \
   --inference_tp_size 1 \
   --actor_zero_stage $ACTOR_ZERO_STAGE \
   --critic_zero_stage $CRITIC_ZERO_STAGE \
   --actor_gradient_checkpointing \
   --disable_actor_dropout \
   --actor_lora_dim 128 \
   --actor_lora_module_name decoder.layers. \
   --output_dir $OUTPUT

the follow script is deepspeed config

    zero_opt_dict = {
        "stage": 3,
        "offload_param": {
            "device": "none",
        },
        "offload_optimizer": {
            "device": "none",
        },
        "stage3_param_persistence_threshold": 1e4,
        "stage3_max_live_parameters": 1e7,
        "stage3_prefetch_bucket_size": 1e7,
        "memory_efficient_linear": False,
        "reduce_bucket_size": 1e7,
        "allgather_bucket_size": 1e7,
        "stage3_max_reuse_distance": 1e7
    }

   all_config = {
        "train_batch_size": GLOBAL_BATCH_SIZE,
        "train_micro_batch_size_per_gpu": MICRO_BATCH_SIZE,
        "steps_per_print": 10,
        "zero_optimization": zero_opt_dict,
        "fp16": {
            "enabled": True
        },
        "gradient_clipping": 1.0,
        "prescale_gradients": False,
        "wall_clock_breakdown": False
    }

ds_report

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/root/miniconda3/envs/dschat/lib/python3.9/site-packages/torch']
torch version .................... 1.12.1+cu116
deepspeed install path ........... ['/root/miniconda3/envs/dschat/lib/python3.9/site-packages/deepspeed']
deepspeed info ................... 0.9.5+b692d236, b692d236, master
torch cuda version ............... 11.6
torch hip version ................ None
nvcc version ..................... 11.6
deepspeed wheel compiled w. ...... torch 1.12, cuda 11.6

@HeyangQin

Fhujinwu · 2023-06-27T12:15:01Z

+1，please tell me how to address this issue. "RuntimeError: still have inflight params [<bound method Init._convert_to_deepspeed_param..ds_summary of Parameter containing"

dearchill · 2023-06-29T08:50:17Z

Still have similar inflight params issue with 0.10.0+f8551b43 when running deepspeedchat.

HeyangQin · 2023-06-29T18:30:39Z

One of our recent fixes #3819 should have fixed this issue. It is not included in the pypi release yet so you need to install deepspeed from source to apply this fix. Please let us know if you still see this inflight issue even with the fix.

Fhujinwu · 2023-06-30T09:45:42Z

One of our recent fixes #3819 should have fixed this issue. It is not included in the pypi release yet so you need to install deepspeed from source to apply this fix. Please let us know if you still see this inflight issue even with the fix.

Still have similar inflight params issue with deepspeed_0.10.0+fd1d2c64 when running deepspeedchat. deepspeedai/DeepSpeedExamples#616

dearchill · 2023-06-30T09:53:53Z

One of our recent fixes #3819 should have fixed this issue. It is not included in the pypi release yet so you need to install deepspeed from source to apply this fix. Please let us know if you still see this inflight issue even with the fix.

Yes, I installed 0.10.0+f8551b43 from source, but the issue still remained. For my case, the branch "HeyangQin/fix_issue_3156" solved it. Thanks a lot anyway. You can also try the branch "HeyangQin/fix_issue_3156". @Fhujinwu

Fhujinwu · 2023-07-01T01:01:23Z

One of our recent fixes #3819 should have fixed this issue. It is not included in the pypi release yet so you need to install deepspeed from source to apply this fix. Please let us know if you still see this inflight issue even with the fix.

Yes, I installed 0.10.0+f8551b43 from source, but the issue still remained. For my case, the branch "HeyangQin/fix_issue_3156" solved it. Thanks a lot anyway. You can also try the branch "HeyangQin/fix_issue_3156". @Fhujinwu

thank you，the branch "HeyangQin/fix_issue_3156" solved the issue.

vittorio-perera · 2023-08-18T00:18:49Z

I came across the same error (RuntimeError: still have inflight params) checking out and installing the branch HeyangQin/fix_issue_3156 resolved it but using the current main did not. Unfortunately the branch is at version 0.9. Any chance it will be merged into main so that the problem is fixed on version 0.10? @HeyangQin

Happy to provide any additional info if it can help 😃

iamsile · 2023-08-19T10:35:33Z

I came across the same error too (RuntimeError: still have inflight params) . When I use deepspeed to train RL on v100 * 8, this bug still exists. I also switch branch to HeyangQin/fix_issue_3156, but it doesn't work.
Happy to provide any additional info if it can help.

my deepspeed vision 0.10.1. @HeyangQin

This is Log output

File "/xxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 259, in train_rlhf
self.critic_model.backward(critic_loss)
File "/xxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/xxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1890, in backward
self.optimizer.backward(loss, retain_graph=retain_graph)
File "/xxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 2031, in backward
self._get_param_coordinator(training=True)
File "/xxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 199, in reset_step
aise RuntimeError(f"still have inflight params "

RuntimeError: RuntimeError: still have inflight params [{'id': 4548, 'status': 'INFLIGHT', 'numel': 412139520, 'ds_numel': 412139520, 'shape': (80496, 5120), 'ds_shape': (80496, 5120), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([103034880])}, {'id': 4035, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 1638400, 'shape': (0,), 'ds_shape': (1280, 1280), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([409600])}, {'id': 4027, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 328960, 'shape': (0,), 'ds_shape': (257, 1280), 'requires_grad': False...........................................

This is my ds_report:

Setting ds_accelerator to cuda (auto detect)

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja
ninja .................. [OKAY]

op name ................ installed .. compatible

[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
[WARNING] please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]

DeepSpeed general environment info:
torch install path ............... ['/xxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch']
torch version .................... 1.13.1
deepspeed install path ........... ['/xxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed-0.9.3+f2d600ba-py3.10.egg/deepspeed']
deepspeed info ................... 0.9.3+f2d600ba, f2d600b, HeyangQin/fix_issue_3156
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 10.1
deepspeed wheel compiled w. ...... torch 1.13, cuda 11.7

HeyangQin · 2023-08-24T05:56:31Z

Hello @iamsile @vittorio-perera. Could you provide a reproduction script for us to better investigate this issue? Thank you

vittorio-perera · 2023-08-24T12:08:21Z

@HeyangQin I'll try to put together a script as soon as possible. thanks for getting back on this!

iamsile · 2023-08-25T09:07:43Z

Hello @iamsile @vittorio-perera. Could you provide a reproduction script for us to better investigate this issue? Thank you

@HeyangQin This is a full record:#4175. ~~I used HeyangQin/fix_issue_3156 to fix it~~. ~~But I found this branch didn't merge into master.~~ Looking forward to your reply.

@HeyangQin In my lastest test. I found copy HeyangQin/fix_issue_3156 into master hasn't work. In RL training, it only works at step0, after that it must be crash. this is a full report:

reward score --> step=0, rank=2, tensor([0.2354], device='cuda:2', dtype=torch.bfloat16)
reward score --> step=0, rank=0, tensor([0.4492], device='cuda:0', dtype=torch.bfloat16)
reward score --> step=0, rank=1, tensor([0.0601], device='cuda:1', dtype=torch.bfloat16)
use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False...
use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False...
use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False...
/xxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")
/xxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")
/xxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")
Epoch: 0 | Step: 0 | PPO Epoch: 1 | Actor Loss: -0.55859375 | Critic Loss: 0.216796875 | Unsupervised Loss: 0.0
Average reward score: 0.2470703125

Traceback (most recent call last):
File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/main.py", line 672, in
main()
File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/main.py", line 541, in main
out = trainer.generate_experience(batch_prompt['images'],
File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 153, in generate_experience
seq = self._generate_sequence(images, prompts, mask, step)
File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 95, in _generate_sequence
seq = self.actor_model.module.forward(images,
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/peft/peft_model.py", line 296, in forward
return self.get_base_model()(*args, **kwargs)
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
result = forward_call(*input, **kwargs)
File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/modeling_mplug_owl.py", line 1672, in forward
outputs = self.language_model.generate(
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/transformers/generation/utils.py", line 1538, in generate
return self.greedy_search(
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/transformers/generation/utils.py", line 2362, in greedy_search
outputs = self(
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
result = forward_call(*input, **kwargs)
File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 794, in forward
outputs = self.model(
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
result = forward_call(*input, **kwargs)
File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 677, in forward
layer_outputs = decoder_layer(
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
result = forward_call(*input, **kwargs)
File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 372, in forward
hidden_states = self.input_layernorm(hidden_states)
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1201, in _call_impl
result = hook(self, input)
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 383, in _pre_forward_module_hook
self.pre_sub_module_forward_function(module)
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 495, in pre_sub_module_forward_function
param_coordinator.fetch_sub_module(sub_module, forward=True)
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 328, in fetch_sub_module
assert param.ds_status == ZeroParamStatus.AVAILABLE, param.ds_summary()
AssertionError: {'id': 918, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 5120, 'shape': (0,), 'ds_shape': (5120,), 'requires_grad': False, 'grad_shape': None, 'persist': True, 'active_sub_modules': {578}, 'ds_tensor.shape': torch.Size([1707])}
Traceback (most recent call last):
File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/main.py", line 672, in
main()
File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/main.py", line 541, in main
out = trainer.generate_experience(batch_prompt['images'],
File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 153, in generate_experience
seq = self._generate_sequence(images, prompts, mask, step)
File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 95, in _generate_sequence
seq = self.actor_model.module.forward(images,
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/peft/peft_model.py", line 296, in forward
return self.get_base_model()(*args, **kwargs)
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
result = forward_call(*input, **kwargs)
File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/modeling_mplug_owl.py", line 1672, in forward
outputs = self.language_model.generate(
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/transformers/generation/utils.py", line 1538, in generate
return self.greedy_search(
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/transformers/generation/utils.py", line 2362, in greedy_search
outputs = self(
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
result = forward_call(*input, **kwargs)
File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 794, in forward
outputs = self.model(
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
result = forward_call(*input, **kwargs)
File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 677, in forward
layer_outputs = decoder_layer(
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
result = forward_call(*input, **kwargs)
File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 372, in forward
hidden_states = self.input_layernorm(hidden_states)
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1201, in _call_impl
result = hook(self, input)
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 383, in _pre_forward_module_hook
self.pre_sub_module_forward_function(module)
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 495, in pre_sub_module_forward_function
param_coordinator.fetch_sub_module(sub_module, forward=True)
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 328, in fetch_sub_module
assert param.ds_status == ZeroParamStatus.AVAILABLE, param.ds_summary()
AssertionError: {'id': 918, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 5120, 'shape': (0,), 'ds_shape': (5120,), 'requires_grad': False, 'grad_shape': None, 'persist': True, 'active_sub_modules': {578}, 'ds_tensor.shape': torch.Size([1707])}
Traceback (most recent call last):
File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/main.py", line 672, in
main()
File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/main.py", line 541, in main
out = trainer.generate_experience(batch_prompt['images'],
File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 153, in generate_experience
seq = self._generate_sequence(images, prompts, mask, step)
File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 95, in _generate_sequence
seq = self.actor_model.module.forward(images,
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/peft/peft_model.py", line 296, in forward
return self.get_base_model()(*args, **kwargs)
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
result = forward_call(*input, **kwargs)
File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/modeling_mplug_owl.py", line 1672, in forward
outputs = self.language_model.generate(
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/transformers/generation/utils.py", line 1538, in generate
return self.greedy_search(
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/transformers/generation/utils.py", line 2362, in greedy_search
outputs = self(
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
result = forward_call(*input, **kwargs)
File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 794, in forward
outputs = self.model(
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
result = forward_call(*input, **kwargs)
File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 677, in forward
layer_outputs = decoder_layer(
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
result = forward_call(*input, **kwargs)
File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 372, in forward
hidden_states = self.input_layernorm(hidden_states)
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1201, in _call_impl
result = hook(self, input)
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 383, in _pre_forward_module_hook
self.pre_sub_module_forward_function(module)
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 495, in pre_sub_module_forward_function
param_coordinator.fetch_sub_module(sub_module, forward=True)
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 328, in fetch_sub_module
assert param.ds_status == ZeroParamStatus.AVAILABLE, param.ds_summary()
AssertionError: {'id': 918, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 5120, 'shape': (0,), 'ds_shape': (5120,), 'requires_grad': False, 'grad_shape': None, 'persist': True, 'active_sub_modules': {578}, 'ds_tensor.shape': torch.Size([1707])}
[2023-08-28 07:34:58,470] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 46301
[2023-08-28 07:34:58,470] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 46302
[2023-08-28 07:34:58,476] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 46303

jiahuanluo · 2023-09-05T04:46:12Z

Hello @iamsile @vittorio-perera. Could you provide a reproduction script for us to better investigate this issue? Thank you

@HeyangQin This is a full record:#4175. ~~I used HeyangQin/fix_issue_3156 to fix it~~. ~~But I found this branch didn't merge into master.~~ Looking forward to your reply.

@HeyangQin In my lastest test. I found copy HeyangQin/fix_issue_3156 into master hasn't work. In RL training, it only works at step0, after that it must be crash. this is a full report:

reward score --> step=0, rank=2, tensor([0.2354], device='cuda:2', dtype=torch.bfloat16) reward score --> step=0, rank=0, tensor([0.4492], device='cuda:0', dtype=torch.bfloat16) reward score --> step=0, rank=1, tensor([0.0601], device='cuda:1', dtype=torch.bfloat16) use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False... use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False... use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False... /xxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None warnings.warn("None of the inputs have requires_grad=True. Gradients will be None") /xxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None warnings.warn("None of the inputs have requires_grad=True. Gradients will be None") /xxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None warnings.warn("None of the inputs have requires_grad=True. Gradients will be None") Epoch: 0 | Step: 0 | PPO Epoch: 1 | Actor Loss: -0.55859375 | Critic Loss: 0.216796875 | Unsupervised Loss: 0.0 Average reward score: 0.2470703125

Traceback (most recent call last): File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/main.py", line 672, in main() File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/main.py", line 541, in main out = trainer.generate_experience(batch_prompt['images'], File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 153, in generate_experience seq = self._generate_sequence(images, prompts, mask, step) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 95, in _generate_sequence seq = self.actor_model.module.forward(images, File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/peft/peft_model.py", line 296, in forward return self.get_base_model()(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/modeling_mplug_owl.py", line 1672, in forward outputs = self.language_model.generate( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/transformers/generation/utils.py", line 1538, in generate return self.greedy_search( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/transformers/generation/utils.py", line 2362, in greedy_search outputs = self( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 794, in forward outputs = self.model( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 677, in forward layer_outputs = decoder_layer( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 372, in forward hidden_states = self.input_layernorm(hidden_states) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1201, in _call_impl result = hook(self, input) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 383, in _pre_forward_module_hook self.pre_sub_module_forward_function(module) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 495, in pre_sub_module_forward_function param_coordinator.fetch_sub_module(sub_module, forward=True) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 328, in fetch_sub_module assert param.ds_status == ZeroParamStatus.AVAILABLE, param.ds_summary() AssertionError: {'id': 918, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 5120, 'shape': (0,), 'ds_shape': (5120,), 'requires_grad': False, 'grad_shape': None, 'persist': True, 'active_sub_modules': {578}, 'ds_tensor.shape': torch.Size([1707])} Traceback (most recent call last): File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/main.py", line 672, in main() File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/main.py", line 541, in main out = trainer.generate_experience(batch_prompt['images'], File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 153, in generate_experience seq = self._generate_sequence(images, prompts, mask, step) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 95, in _generate_sequence seq = self.actor_model.module.forward(images, File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/peft/peft_model.py", line 296, in forward return self.get_base_model()(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/modeling_mplug_owl.py", line 1672, in forward outputs = self.language_model.generate( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/transformers/generation/utils.py", line 1538, in generate return self.greedy_search( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/transformers/generation/utils.py", line 2362, in greedy_search outputs = self( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 794, in forward outputs = self.model( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 677, in forward layer_outputs = decoder_layer( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 372, in forward hidden_states = self.input_layernorm(hidden_states) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1201, in _call_impl result = hook(self, input) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 383, in _pre_forward_module_hook self.pre_sub_module_forward_function(module) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 495, in pre_sub_module_forward_function param_coordinator.fetch_sub_module(sub_module, forward=True) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 328, in fetch_sub_module assert param.ds_status == ZeroParamStatus.AVAILABLE, param.ds_summary() AssertionError: {'id': 918, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 5120, 'shape': (0,), 'ds_shape': (5120,), 'requires_grad': False, 'grad_shape': None, 'persist': True, 'active_sub_modules': {578}, 'ds_tensor.shape': torch.Size([1707])} Traceback (most recent call last): File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/main.py", line 672, in main() File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/main.py", line 541, in main out = trainer.generate_experience(batch_prompt['images'], File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 153, in generate_experience seq = self._generate_sequence(images, prompts, mask, step) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 95, in _generate_sequence seq = self.actor_model.module.forward(images, File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/peft/peft_model.py", line 296, in forward return self.get_base_model()(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/modeling_mplug_owl.py", line 1672, in forward outputs = self.language_model.generate( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/transformers/generation/utils.py", line 1538, in generate return self.greedy_search( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/transformers/generation/utils.py", line 2362, in greedy_search outputs = self( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 794, in forward outputs = self.model( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 677, in forward layer_outputs = decoder_layer( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 372, in forward hidden_states = self.input_layernorm(hidden_states) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1201, in _call_impl result = hook(self, input) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 383, in _pre_forward_module_hook self.pre_sub_module_forward_function(module) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 495, in pre_sub_module_forward_function param_coordinator.fetch_sub_module(sub_module, forward=True) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 328, in fetch_sub_module assert param.ds_status == ZeroParamStatus.AVAILABLE, param.ds_summary() AssertionError: {'id': 918, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 5120, 'shape': (0,), 'ds_shape': (5120,), 'requires_grad': False, 'grad_shape': None, 'persist': True, 'active_sub_modules': {578}, 'ds_tensor.shape': torch.Size([1707])} [2023-08-28 07:34:58,470] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 46301 [2023-08-28 07:34:58,470] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 46302 [2023-08-28 07:34:58,476] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 46303

Same. Work at step 0 and then crash with raise RuntimeError(f"{param.ds_summary()} already in registry")
RuntimeError: {'id': 0, 'status': 'INFLIGHT', 'numel': 262144000, 'ds_numel': 262144000, 'shape': (64000, 4096), 'ds_shape': (64000, 4096), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([65536000])} already in registry

iamsile · 2023-09-11T08:59:55Z

Hello @iamsile @vittorio-perera. Could you provide a reproduction script for us to better investigate this issue? Thank you

@HeyangQin This is a full record:#4175. ~~I used HeyangQin/fix_issue_3156 to fix it~~. ~~But I found this branch didn't merge into master.~~ Looking forward to your reply.

@HeyangQin In my lastest test. I found copy HeyangQin/fix_issue_3156 into master hasn't work. In RL training, it only works at step0, after that it must be crash. this is a full report:

reward score --> step=0, rank=2, tensor([0.2354], device='cuda:2', dtype=torch.bfloat16) reward score --> step=0, rank=0, tensor([0.4492], device='cuda:0', dtype=torch.bfloat16) reward score --> step=0, rank=1, tensor([0.0601], device='cuda:1', dtype=torch.bfloat16) use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False... use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False... use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False... /xxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None warnings.warn("None of the inputs have requires_grad=True. Gradients will be None") /xxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None warnings.warn("None of the inputs have requires_grad=True. Gradients will be None") /xxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None warnings.warn("None of the inputs have requires_grad=True. Gradients will be None") Epoch: 0 | Step: 0 | PPO Epoch: 1 | Actor Loss: -0.55859375 | Critic Loss: 0.216796875 | Unsupervised Loss: 0.0 Average reward score: 0.2470703125

Traceback (most recent call last): File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/main.py", line 672, in main() File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/main.py", line 541, in main out = trainer.generate_experience(batch_prompt['images'], File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 153, in generate_experience seq = self._generate_sequence(images, prompts, mask, step) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 95, in _generate_sequence seq = self.actor_model.module.forward(images, File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/peft/peft_model.py", line 296, in forward return self.get_base_model()(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/modeling_mplug_owl.py", line 1672, in forward outputs = self.language_model.generate( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/transformers/generation/utils.py", line 1538, in generate return self.greedy_search( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/transformers/generation/utils.py", line 2362, in greedy_search outputs = self( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 794, in forward outputs = self.model( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 677, in forward layer_outputs = decoder_layer( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 372, in forward hidden_states = self.input_layernorm(hidden_states) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1201, in _call_impl result = hook(self, input) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 383, in _pre_forward_module_hook self.pre_sub_module_forward_function(module) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 495, in pre_sub_module_forward_function param_coordinator.fetch_sub_module(sub_module, forward=True) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 328, in fetch_sub_module assert param.ds_status == ZeroParamStatus.AVAILABLE, param.ds_summary() AssertionError: {'id': 918, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 5120, 'shape': (0,), 'ds_shape': (5120,), 'requires_grad': False, 'grad_shape': None, 'persist': True, 'active_sub_modules': {578}, 'ds_tensor.shape': torch.Size([1707])} Traceback (most recent call last): File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/main.py", line 672, in main() File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/main.py", line 541, in main out = trainer.generate_experience(batch_prompt['images'], File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 153, in generate_experience seq = self._generate_sequence(images, prompts, mask, step) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 95, in _generate_sequence seq = self.actor_model.module.forward(images, File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/peft/peft_model.py", line 296, in forward return self.get_base_model()(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/modeling_mplug_owl.py", line 1672, in forward outputs = self.language_model.generate( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/transformers/generation/utils.py", line 1538, in generate return self.greedy_search( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/transformers/generation/utils.py", line 2362, in greedy_search outputs = self( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 794, in forward outputs = self.model( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 677, in forward layer_outputs = decoder_layer( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 372, in forward hidden_states = self.input_layernorm(hidden_states) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1201, in _call_impl result = hook(self, input) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 383, in _pre_forward_module_hook self.pre_sub_module_forward_function(module) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 495, in pre_sub_module_forward_function param_coordinator.fetch_sub_module(sub_module, forward=True) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 328, in fetch_sub_module assert param.ds_status == ZeroParamStatus.AVAILABLE, param.ds_summary() AssertionError: {'id': 918, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 5120, 'shape': (0,), 'ds_shape': (5120,), 'requires_grad': False, 'grad_shape': None, 'persist': True, 'active_sub_modules': {578}, 'ds_tensor.shape': torch.Size([1707])} Traceback (most recent call last): File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/main.py", line 672, in main() File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/main.py", line 541, in main out = trainer.generate_experience(batch_prompt['images'], File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 153, in generate_experience seq = self._generate_sequence(images, prompts, mask, step) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 95, in _generate_sequence seq = self.actor_model.module.forward(images, File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/peft/peft_model.py", line 296, in forward return self.get_base_model()(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/modeling_mplug_owl.py", line 1672, in forward outputs = self.language_model.generate( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/transformers/generation/utils.py", line 1538, in generate return self.greedy_search( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/transformers/generation/utils.py", line 2362, in greedy_search outputs = self( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 794, in forward outputs = self.model( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 677, in forward layer_outputs = decoder_layer( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 372, in forward hidden_states = self.input_layernorm(hidden_states) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1201, in _call_impl result = hook(self, input) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 383, in _pre_forward_module_hook self.pre_sub_module_forward_function(module) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 495, in pre_sub_module_forward_function param_coordinator.fetch_sub_module(sub_module, forward=True) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 328, in fetch_sub_module assert param.ds_status == ZeroParamStatus.AVAILABLE, param.ds_summary() AssertionError: {'id': 918, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 5120, 'shape': (0,), 'ds_shape': (5120,), 'requires_grad': False, 'grad_shape': None, 'persist': True, 'active_sub_modules': {578}, 'ds_tensor.shape': torch.Size([1707])} [2023-08-28 07:34:58,470] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 46301 [2023-08-28 07:34:58,470] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 46302 [2023-08-28 07:34:58,476] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 46303

I have fixed it and close it.

jiahuanluo · 2023-09-12T06:42:00Z

@iamsile Hi, Could you please tell me how to fix this? Many Thanks.

XenonLamb · 2024-03-21T23:38:47Z

@iamsile Has your case been resolved with the latest deepspeed version? I observed similar issues recently. typically with a bert model and some linear layers, under zero-3. the training process starts with some "Invalidate trace cache @ step xx: expected module xx, but got module xxx", and then after about 10 steps. it aborts with RuntimeError: still have inflight params [{'id': 1143, 'status': 'AVAILABLE', 'numel': 1982464, 'ds_numel': 1982464, 'shape': (1408, 1408), 'ds_shape': (1408, 1408), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([247808])}, {'id': 1145, 'status': 'AVAILABLE', 'numel': 5767168, 'ds_numel': 5767168, 'shape': (4096, 1408), 'ds_shape': (4096, 1408), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([720896])}]

dzagardo · 2024-04-29T00:36:26Z

i've also got this issue @XenonLamb , mine happens about 230 steps in at the end of a batch when i'm returning a dummy loss value. not sure what the deal is 🤨

[rank6]: RuntimeError: still have inflight params [{'id': 6, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 1572864, 'shape': (0,), 'ds_shape': (1024, 512, 3), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([196608])}, {'id': 10, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 3145728, 'shape': (0,), 'ds_shape': (3072, 1024), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([393216])}, {'id': 18, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 1048576, 'shape': (0,), 'ds_shape': (1024, 1024), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([131072])}, {'id': 16, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 4194304, 'shape': (0,), 'ds_shape': (4096, 1024), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([524288])}, {'id': 22, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 4194304, 'shape': (0,), 'ds_shape': (1024, 4096), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([524288])}, {'id': 23, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 1048576, 'shape': (0,), 'ds_shape': (1024, 1024), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([131072])}, {'id': 28, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 3145728, 'shape': (0,), 'ds_shape': (3072, 1024), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([393216])}, {'id': 36, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 1048576, 'shape': (0,), 'ds_shape': (1024, 1024), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([131072])}, {'id': 34, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 4194304, 'shape': (0,), 'ds_shape': (4096, 1024), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([524288])}, {'id': 37, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 4194304, 'shape': (0,), 'ds_shape': (4096, 1024), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([524288])}, {'id': 4, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 393216, 'shape': (0,), 'ds_shape': (512, 256, 3), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([49152])}, {'id': 12, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 1048576, 'shape': (0,), 'ds_shape': (1024, 1024), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([131072])}, {'id': 8, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 3145728, 'shape': (0,), 'ds_shape': (1024, 1024, 3), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([393216])}, {'id': 19, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 4194304, 'shape': (0,), 'ds_shape': (4096, 1024), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([524288])}, {'id': 20, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 4194304, 'shape': (0,), 'ds_shape': (1024, 4096), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([524288])}, {'id': 30, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 1048576, 'shape': (0,), 'ds_shape': (1024, 1024), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([131072])}, {'id': 40, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 4194304, 'shape': (0,), 'ds_shape': (1024, 4096), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([524288])}, {'id': 38, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 4194304, 'shape': (0,), 'ds_shape': (1024, 4096), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([524288])}, {'id': 46, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 3145728, 'shape': (0,), 'ds_shape': (3072, 1024), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([393216])}, {'id': 55, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 4194304, 'shape': (0,), 'ds_shape': (4096, 1024), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([524288])}, {'id': 77, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 1048576, 'shape': (0,), 'ds_shape': (1024, 1024), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([131072])}, {'id': 89, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 524288, 'shape': (0,), 'ds_shape': (512, 1024), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([65536])}, {'id': 58, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 4194304, 'shape': (0,), 'ds_shape': (1024, 4096), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([524288])}, {'id': 66, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 1048576, 'shape': (0,), 'ds_shape': (1024, 1024), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([131072])}, {'id': 73, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 4194304, 'shape': (0,), 'ds_shape': (4096, 1024), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([524288])}, {'id': 82, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 262144, 'shape': (0,), 'ds_shape': (512, 512), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([32768])}, {'id': 56, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 4194304, 'shape': (0,), 'ds_shape': (1024, 4096), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([524288])}, {'id': 70, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 4194304, 'shape': (0,), 'ds_shape': (4096, 1024), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([524288])}, {'id': 76, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 4194304, 'shape': (0,), 'ds_shape': (1024, 4096), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([524288])}, {'id': 88, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 524288, 'shape': (0,), 'ds_shape': (1024, 512), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([65536])}, {'id': 64, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 3145728, 'shape': (0,), 'ds_shape': (3072, 1024), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([393216])}, {'id': 85, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 524288, 'shape': (0,), 'ds_shape': (512, 1024), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([65536])}, {'id': 86, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 262144, 'shape': (0,), 'ds_shape': (512, 512), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([32768])}, {'id': 48, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 1048576, 'shape': (0,), 'ds_shape': (1024, 1024), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([131072])}, {'id': 72, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 1048576, 'shape': (0,), 'ds_shape': (1024, 1024), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([131072])}, {'id': 54, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 1048576, 'shape': (0,), 'ds_shape': (1024, 1024), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([131072])}, {'id': 59, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 1048576, 'shape': (0,), 'ds_shape': (1024, 1024), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([131072])}, {'id': 74, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 4194304, 'shape': (0,), 'ds_shape': (1024, 4096), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([524288])}, {'id': 41, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 1048576, 'shape': (0,), 'ds_shape': (1024, 1024), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([131072])}, {'id': 52, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 4194304, 'shape': (0,), 'ds_shape': (4096, 1024), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([524288])}, {'id': 84, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 524288, 'shape': (0,), 'ds_shape': (1024, 512), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([65536])}, {'id': 92, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 1048576, 'shape': (0,), 'ds_shape': (1024, 1024), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([131072])}]

daehuikim · 2024-05-14T01:13:29Z

I have same issue when I use zero stage 3 with latest deepspeed version. This issue occurs after first evaluation step. How could I fix it?

jiahuanluo · 2024-05-14T07:23:19Z

try the latest version driver and cuda.

cyrildiagne · 2024-05-27T15:58:55Z

Currently facing the same ussing with zero stage 3 and first validation step in pytorch lightning

karthik-nexusflow · 2024-06-03T07:56:25Z

facing the same issue File "/root/miniconda3/envs/open/lib/python3.11/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 351, in _end_of_forward_hook
self.get_param_coordinator(training=False).reset_step()
File "/root/miniconda3/envs/open/lib/python3.11/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 204, in reset_step
raise RuntimeError(f"still have inflight params "
RuntimeError: still have inflight params [{'id': 723, 'status': 'AVAILABLE', 'numel': 4194304, 'ds_numel': 4194304, 'shape': (512, 8192), 'ds_shape': (512, 8192)

syx11237744 · 2024-07-23T09:30:38Z

facing the same issue File "/root/miniconda3/envs/open/lib/python3.11/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 351, in _end_of_forward_hook self.get_param_coordinator(training=False).reset_step() File "/root/miniconda3/envs/open/lib/python3.11/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 204, in reset_step raise RuntimeError(f"still have inflight params " RuntimeError: still have inflight params [{'id': 723, 'status': 'AVAILABLE', 'numel': 4194304, 'ds_numel': 4194304, 'shape': (512, 8192), 'ds_shape': (512, 8192)

same problem+1

Osilly · 2024-08-14T11:33:21Z

Hello? I have the same problem in training a dynamic forward network. How to solve this problem? Thanks.

marsggbo added bug Something isn't working training labels Apr 6, 2023

tjruwase assigned HeyangQin Apr 7, 2023

HeyangQin mentioned this issue Apr 25, 2023

make parameters status shared by all PartitionedParameterCoordinator instances #3380

Closed

HeyangQin mentioned this issue May 5, 2023

share inflight registry between PartitionedParameterCoordinators #3462

Merged

HeyangQin closed this as completed May 19, 2023

HeyangQin added a commit that referenced this issue May 25, 2023

fix #3156

f2d600b

yangzhipeng1108 mentioned this issue Jun 16, 2023

RuntimeError: still have inflight params [<bound method Init._convert_to_deepspeed_param.<locals>.ds_summary of Parameter yangzhipeng1108/DeepSpeed-Chat-ChatGLM#7

Closed

vivek-media mentioned this issue Jun 20, 2023

[BUG] RuntimeError: inflight params error when using DeepSpeed for Reinforcement Learning huggingface/trl#430

Closed

ZJXNEFU mentioned this issue Jun 21, 2023

When I load model and use deepspeed with ZERO stage 3, I have a problem with high cpu memory Usage #3722

Open

vivek-media mentioned this issue Jun 21, 2023

[BUG] RuntimeError: inflight params error when using DeepSpeed for Reinforcement Learning huggingface/trl#453

Closed

HeyangQin mentioned this issue Jun 26, 2023

Fix racing condition in GatheredParameters #3819

Merged

HeyangQin mentioned this issue Jul 5, 2023

Separate ZeRO3 InflightParamRegistry for train and eval #3884

Merged

formath mentioned this issue Jul 12, 2023

AttributeError: 'DeepSpeedHybridEngine' object has no attribute 'mp_group' in step 3. deepspeedai/DeepSpeedExamples#379

Open

dillonalaird mentioned this issue Jan 28, 2024

[NeuralChat] Finetuning for LLaVA failing on image + text pairs intel/intel-extension-for-transformers#1201

Closed

PommesPeter mentioned this issue Feb 22, 2024

[Feature] Support Gemma InternLM/xtuner#429

Merged

qgallouedec mentioned this issue Aug 22, 2024

[BUG] RuntimeError: still have inflight params in KTO huggingface/trl#1732

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] RuntimeError: still have inflight params [<bound method Init._convert_to_deepspeed_param.<locals>.ds_summary of Parameter containing: #3156

[BUG] RuntimeError: still have inflight params [<bound method Init._convert_to_deepspeed_param.<locals>.ds_summary of Parameter containing: #3156

marsggbo commented Apr 6, 2023

tjruwase commented Apr 6, 2023

marsggbo commented Apr 7, 2023

tobideusser commented Apr 19, 2023

tjruwase commented Apr 19, 2023 •

edited

Loading

tobideusser commented Apr 24, 2023

justHungryMan commented May 4, 2023

HeyangQin commented May 19, 2023

ZJXNEFU commented Jun 15, 2023 •

edited

Loading

HeyangQin commented Jun 15, 2023

ZJXNEFU commented Jun 16, 2023

Fhujinwu commented Jun 27, 2023

dearchill commented Jun 29, 2023

HeyangQin commented Jun 29, 2023

Fhujinwu commented Jun 30, 2023

dearchill commented Jun 30, 2023

Fhujinwu commented Jul 1, 2023

vittorio-perera commented Aug 18, 2023

iamsile commented Aug 19, 2023 •

edited

Loading

HeyangQin commented Aug 24, 2023

vittorio-perera commented Aug 24, 2023

iamsile commented Aug 25, 2023 •

edited

Loading

jiahuanluo commented Sep 5, 2023

iamsile commented Sep 11, 2023

jiahuanluo commented Sep 12, 2023

XenonLamb commented Mar 21, 2024

dzagardo commented Apr 29, 2024 •

edited

Loading

daehuikim commented May 14, 2024

jiahuanluo commented May 14, 2024 •

edited

Loading

cyrildiagne commented May 27, 2024

karthik-nexusflow commented Jun 3, 2024

syx11237744 commented Jul 23, 2024

Osilly commented Aug 14, 2024

[BUG] RuntimeError: still have inflight params [<bound method Init._convert_to_deepspeed_param.<locals>.ds_summary of Parameter containing: #3156

[BUG] RuntimeError: still have inflight params [<bound method Init._convert_to_deepspeed_param.<locals>.ds_summary of Parameter containing: #3156

Comments

marsggbo commented Apr 6, 2023

tjruwase commented Apr 6, 2023

marsggbo commented Apr 7, 2023

tobideusser commented Apr 19, 2023

tjruwase commented Apr 19, 2023 • edited Loading

tobideusser commented Apr 24, 2023

justHungryMan commented May 4, 2023

HeyangQin commented May 19, 2023

ZJXNEFU commented Jun 15, 2023 • edited Loading

HeyangQin commented Jun 15, 2023

ZJXNEFU commented Jun 16, 2023

Fhujinwu commented Jun 27, 2023

dearchill commented Jun 29, 2023

HeyangQin commented Jun 29, 2023

Fhujinwu commented Jun 30, 2023

dearchill commented Jun 30, 2023

Fhujinwu commented Jul 1, 2023

vittorio-perera commented Aug 18, 2023

iamsile commented Aug 19, 2023 • edited Loading

Setting ds_accelerator to cuda (auto detect)

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja ninja .................. [OKAY]

op name ................ installed .. compatible

HeyangQin commented Aug 24, 2023

vittorio-perera commented Aug 24, 2023

iamsile commented Aug 25, 2023 • edited Loading

jiahuanluo commented Sep 5, 2023

iamsile commented Sep 11, 2023

jiahuanluo commented Sep 12, 2023

XenonLamb commented Mar 21, 2024

dzagardo commented Apr 29, 2024 • edited Loading

daehuikim commented May 14, 2024

jiahuanluo commented May 14, 2024 • edited Loading

cyrildiagne commented May 27, 2024

karthik-nexusflow commented Jun 3, 2024

syx11237744 commented Jul 23, 2024

Osilly commented Aug 14, 2024

tjruwase commented Apr 19, 2023 •

edited

Loading

ZJXNEFU commented Jun 15, 2023 •

edited

Loading

iamsile commented Aug 19, 2023 •

edited

Loading

NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja
ninja .................. [OKAY]

iamsile commented Aug 25, 2023 •

edited

Loading

dzagardo commented Apr 29, 2024 •

edited

Loading

jiahuanluo commented May 14, 2024 •

edited

Loading