-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] RuntimeError: still have inflight params [<bound method Init._convert_to_deepspeed_param.<locals>.ds_summary of Parameter containing: #3156
Comments
@marsggbo, thanks for reporting this issue. Can you please provide more details to enable reproducing this problem? |
I use a model with a dynamic forward, below is an example class ToyNASModel(nn.Module):
def __init__(self):
super().__init__()
self.conv1 = torch.nn.Conv2d(3, 512, kernel_size=3, stride=1, padding=1, bias=False)
self.conv2 = torch.nn.Conv2d(512, 1024, kernel_size=3, stride=1, padding=1, bias=False)
self.conv3 = torch.nn.Conv2d(512, 1024, kernel_size=5, stride=1, padding=2, bias=False)
self.gavg = torch.nn.AdaptiveAvgPool2d((1, 1))
self.fc = torch.nn.Linear(1024, 1000, bias=False)
self.count = 0
def forward(self, x):
out = self.conv1(x)
if self.count % 2 == 0:
out = self.conv2(out)
else:
out = self.conv3(out)
self.count += 1
out = self.gavg(out).view(out.size(0), -1)
out = self.fc(out)
return out the ds_config is as below:
|
I have the same issue (using the PyTorch lightning implementation and deepspeed v0.9.0), any pointers on where this originates from would be splendid. |
@marsggbo, thanks for sharing a toy model. However, we need more than this to reproduce the issue. Can you please share script, and code, data, and command line to reproduce the issue? @tobideusser, can you share repro details? |
I'm sorry for taking so long to reply, I was trying to figure out where that error came from. To start with, this is basically my validation_step:
This is what I found out:
Therefore, my toy example actually runs through and is not helpful so far. Could you shed some light on what an "Inflight Parameter" actually is? Is it possible to somehow detach them? I tried simply detaching every tensor in the result after model.generate in the validation step, but this does not change the behaviour. |
I have the same problem using zero3 in pytorch lightning and using the generate function. Is there a solution? |
Hello @justHungryMan @tobideusser. This issue has been fixed by a collaborative effort with the lightning team. Please update the deepspeed and lightning to apply the fix. Thank you. @marsggbo if the error is still there even with the latest deepspeed. Please feel free to reopen this issue with a reproduce script. |
@HeyangQin I also met this problem when running deepspeed chat the deepspeed version is 0.9.5 |
Hello @ZJXNEFU. Could you provide a reproduce script for us to investigate this issue? Thank you |
the follow script is deepspeed config
ds_report
|
+1,please tell me how to address this issue. "RuntimeError: still have inflight params [<bound method Init._convert_to_deepspeed_param..ds_summary of Parameter containing" |
Still have similar inflight params issue with 0.10.0+f8551b43 when running deepspeedchat. |
One of our recent fixes #3819 should have fixed this issue. It is not included in the pypi release yet so you need to install deepspeed from source to apply this fix. Please let us know if you still see this inflight issue even with the fix. |
Still have similar inflight params issue with deepspeed_0.10.0+fd1d2c64 when running deepspeedchat. deepspeedai/DeepSpeedExamples#616 |
Yes, I installed 0.10.0+f8551b43 from source, but the issue still remained. For my case, the branch "HeyangQin/fix_issue_3156" solved it. Thanks a lot anyway. You can also try the branch "HeyangQin/fix_issue_3156". @Fhujinwu |
thank you,the branch "HeyangQin/fix_issue_3156" solved the issue. |
I came across the same error ( Happy to provide any additional info if it can help 😃 |
I came across the same error too (RuntimeError: still have inflight params) . When I use deepspeed to train RL on v100 * 8, this bug still exists. I also switch branch to HeyangQin/fix_issue_3156, but it doesn't work. my deepspeed vision 0.10.1. @HeyangQin This is Log output File "/xxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 259, in train_rlhf RuntimeError: RuntimeError: still have inflight params [{'id': 4548, 'status': 'INFLIGHT', 'numel': 412139520, 'ds_numel': 412139520, 'shape': (80496, 5120), 'ds_shape': (80496, 5120), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([103034880])}, {'id': 4035, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 1638400, 'shape': (0,), 'ds_shape': (1280, 1280), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([409600])}, {'id': 4027, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 328960, 'shape': (0,), 'ds_shape': (257, 1280), 'requires_grad': False........................................... This is my ds_report: Setting ds_accelerator to cuda (auto detect)DeepSpeed C++/CUDA extension op reportNOTE: Ops not installed will be just-in-time (JIT) compiled at
|
Hello @iamsile @vittorio-perera. Could you provide a reproduction script for us to better investigate this issue? Thank you |
@HeyangQin I'll try to put together a script as soon as possible. thanks for getting back on this! |
@HeyangQin This is a full record:#4175. @HeyangQin In my lastest test. I found copy HeyangQin/fix_issue_3156 into master hasn't work. In RL training, it only works at step0, after that it must be crash. this is a full report: reward score --> step=0, rank=2, tensor([0.2354], device='cuda:2', dtype=torch.bfloat16) Traceback (most recent call last): |
Same. Work at step 0 and then crash with raise RuntimeError(f"{param.ds_summary()} already in registry") |
I have fixed it and close it. |
@iamsile Hi, Could you please tell me how to fix this? Many Thanks. |
@iamsile Has your case been resolved with the latest deepspeed version? I observed similar issues recently. typically with a bert model and some linear layers, under zero-3. the training process starts with some "Invalidate trace cache @ step xx: expected module xx, but got module xxx", and then after about 10 steps. it aborts with |
i've also got this issue @XenonLamb , mine happens about 230 steps in at the end of a batch when i'm returning a dummy loss value. not sure what the deal is 🤨
|
I have same issue when I use zero stage 3 with latest deepspeed version. This issue occurs after first evaluation step. How could I fix it? |
try the latest version driver and cuda. |
Currently facing the same ussing with zero stage 3 and first validation step in pytorch lightning |
facing the same issue File "/root/miniconda3/envs/open/lib/python3.11/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 351, in _end_of_forward_hook |
same problem+1 |
Hello? I have the same problem in training a dynamic forward network. How to solve this problem? Thanks. |
Describe the bug
what's the possible reason for error below
ds_report output
The text was updated successfully, but these errors were encountered: