Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

terminate called after throwing an instance of 'c10::Error' #17

Closed
kaihe opened this issue Mar 7, 2022 · 6 comments
Closed

terminate called after throwing an instance of 'c10::Error' #17

kaihe opened this issue Mar 7, 2022 · 6 comments

Comments

@kaihe
Copy link

kaihe commented Mar 7, 2022

I am playing with ldm.models.diffusion.ddpm.LatentDiffusion with 4 GPUs and DDP distribution. After around 30 epochs, it stopped,

`terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: initialization error
Exception raised from insert_events at /opt/conda/conda-bld/pytorch_1603729096996/work/c10/cuda/CUDACachingAllocator.cpp:717 (most recent call first):
frame_#0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f082820c8b2 in /root/miniconda3/envs/ldm/lib/python3.8/site-packages/torch/lib/libc10.so)
frame
#1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x1070 (0x7f082845ef20 in /root/miniconda3/envs/ldm/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame
#2: c10::TensorImpl::release_resources() + 0x4d (0x7f08281f7b7d in /root/miniconda3/envs/ldm/lib/python3.8/site-packages/torch/lib/libc10.so)
frame
#3: + 0x5f65b2 (0x7f08725575b2 in /root/miniconda3/envs/ldm/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame
#4: + 0x13c2bc (0x55b1c22232bc in /root/miniconda3/envs/ldm/bin/python)
frame
#5: + 0x1efd35 (0x55b1c22d6d35 in /root/miniconda3/envs/ldm/bin/python)
frame
#_6: PyObject_GC_Malloc + 0x88 (0x55b1c2223998 in /root/miniconda3/envs/ldm/bin/python)
frame
#7: PyType_GenericAlloc + 0x3b (0x55b1c2293a8b in /root/miniconda3/envs/ldm/bin/python)
frame
#8: + 0xc385 (0x7f08a1bbf385 in /root/miniconda3/envs/ldm/lib/python3.8/site-packages/numpy/random/bit_generator.cpython-38-x86_64-linux-gnu.so)
frame
#9: + 0x13d585 (0x55b1c2224585 in /root/miniconda3/envs/ldm/bin/python)
frame
#10: + 0xf97f (0x7f08a1bc297f in /root/miniconda3/envs/ldm/lib/python3.8/site-packages/numpy/random/bit_generator.cpython-38-x86_64-linux-gnu.so)
frame
#11: + 0xfb7e (0x7f08a1bc2b7e in /root/miniconda3/envs/ldm/lib/python3.8/site-packages/numpy/random/bit_generator.cpython-38-x86_64-linux-gnu.so)
frame
#12: + 0x1e857 (0x7f08a1bd1857 in /root/miniconda3/envs/ldm/lib/python3.8/site-packages/numpy/random/bit_generator.cpython-38-x86_64-linux-gnu.so)
frame
#13: + 0x5f92c (0x55b1c214692c in /root/miniconda3/envs/ldm/bin/python)
frame
#14: + 0x16fb40 (0x55b1c2256b40 in /root/miniconda3/envs/ldm/bin/python)
frame
#_15: + 0xe4d6 (0x7f08a17a84d6 in /root/miniconda3/envs/ldm/lib/python3.8/site-packages/numpy/random/mt19937.cpython-38-x86_64-linux-gnu.so)
frame
#16: + 0x13d60c (0x55b1c222460c in /root/miniconda3/envs/ldm/bin/python)
frame
#17: + 0x14231 (0x7f08a1bf4231 in /root/miniconda3/envs/ldm/lib/python3.8/site-packages/numpy/random/mtrand.cpython-38-x86_64-linux-gnu.so)
frame
#18: + 0x21d0e (0x7f08a1c01d0e in /root/miniconda3/envs/ldm/lib/python3.8/site-packages/numpy/random/mtrand.cpython-38-x86_64-linux-gnu.so)
frame
#_19: PyObject_MakeTpCall + 0x1a4 (0x55b1c22247d4 in /root/miniconda3/envs/ldm/bin/python)
frame
#_20: PyEval_EvalFrameDefault + 0x4596 (0x55b1c22abf56 in /root/miniconda3/envs/ldm/bin/python)
frame
#_21: PyEval_EvalCodeWithName + 0x2d2 (0x55b1c2271a92 in /root/miniconda3/envs/ldm/bin/python)
frame
#_22: PyFunction_Vectorcall + 0x1e3 (0x55b1c2272943 in /root/miniconda3/envs/ldm/bin/python)
frame
#23: + 0x18be79 (0x55b1c2272e79 in /root/miniconda3/envs/ldm/bin/python)
frame
#24: PyVectorcall_Call + 0x71 (0x55b1c2224041 in /root/miniconda3/envs/ldm/bin/python)
frame
#_25: PyEval_EvalFrameDefault + 0x1fdb (0x55b1c22a999b in /root/miniconda3/envs/ldm/bin/python)
frame
#_26: PyEval_EvalCodeWithName + 0x7df (0x55b1c2271f9f in /root/miniconda3/envs/ldm/bin/python)
frame
#_27: PyFunction_Vectorcall + 0x1e3 (0x55b1c2272943 in /root/miniconda3/envs/ldm/bin/python)
frame
#28: + 0x18be79 (0x55b1c2272e79 in /root/miniconda3/envs/ldm/bin/python)
frame
#29: PyVectorcall_Call + 0x71 (0x55b1c2224041 in /root/miniconda3/envs/ldm/bin/python)
frame
#_30: PyEval_EvalFrameDefault + 0x1fdb (0x55b1c22a999b in /root/miniconda3/envs/ldm/bin/python)
frame
#_31: PyEval_EvalCodeWithName + 0x7df (0x55b1c2271f9f in /root/miniconda3/envs/ldm/bin/python)
frame
#_32: PyFunction_Vectorcall + 0x1e3 (0x55b1c2272943 in /root/miniconda3/envs/ldm/bin/python)
frame
#_33: PyObject_FastCallDict + 0x24b (0x55b1c22734cb in /root/miniconda3/envs/ldm/bin/python)
frame
#_34: PyObject_Call_Prepend + 0x63 (0x55b1c2273733 in /root/miniconda3/envs/ldm/bin/python)
frame
#35: + 0x18c83a (0x55b1c227383a in /root/miniconda3/envs/ldm/bin/python)
frame
#36: PyObject_Call + 0x70 (0x55b1c2224200 in /root/miniconda3/envs/ldm/bin/python)
frame
#_37: PyEval_EvalFrameDefault + 0x1fdb (0x55b1c22a999b in /root/miniconda3/envs/ldm/bin/python)
frame
#_38: PyEval_EvalCodeWithName + 0x2d2 (0x55b1c2271a92 in /root/miniconda3/envs/ldm/bin/python)
frame
#_39: PyFunction_Vectorcall + 0x1e3 (0x55b1c2272943 in /root/miniconda3/envs/ldm/bin/python)
frame
#_40: PyObject_FastCallDict + 0x24b (0x55b1c22734cb in /root/miniconda3/envs/ldm/bin/python)
frame
#_41: PyObject_Call_Prepend + 0x63 (0x55b1c2273733 in /root/miniconda3/envs/ldm/bin/python)
frame
#42: + 0x18c83a (0x55b1c227383a in /root/miniconda3/envs/ldm/bin/python)
frame
#_43: PyObject_MakeTpCall + 0x22f (0x55b1c222485f in /root/miniconda3/envs/ldm/bin/python)
frame
#_44: PyEval_EvalFrameDefault + 0x11d0 (0x55b1c22a8b90 in /root/miniconda3/envs/ldm/bin/python)
frame
#_45: PyFunction_Vectorcall + 0x10b (0x55b1c227286b in /root/miniconda3/envs/ldm/bin/python)
frame
#46: + 0xba0de (0x55b1c21a10de in /root/miniconda3/envs/ldm/bin/python)
frame
#47: + 0x17eb32 (0x55b1c2265b32 in /root/miniconda3/envs/ldm/bin/python)
frame
#48: PyObject_GetItem + 0x49 (0x55b1c22568c9 in /root/miniconda3/envs/ldm/bin/python)
frame
#_49: PyEval_EvalFrameDefault + 0xbdd (0x55b1c22a859d in /root/miniconda3/envs/ldm/bin/python)
frame
#_50: PyEval_EvalCodeWithName + 0x659 (0x55b1c2271e19 in /root/miniconda3/envs/ldm/bin/python)
frame
#_51: PyFunction_Vectorcall + 0x1e3 (0x55b1c2272943 in /root/miniconda3/envs/ldm/bin/python)
frame
#52: + 0xfeb84 (0x55b1c21e5b84 in /root/miniconda3/envs/ldm/bin/python)
frame
#_53: PyEval_EvalCodeWithName + 0x7df (0x55b1c2271f9f in /root/miniconda3/envs/ldm/bin/python)
frame
#_54: PyFunction_Vectorcall + 0x1e3 (0x55b1c2272943 in /root/miniconda3/envs/ldm/bin/python)
frame
#55: + 0x10075e (0x55b1c21e775e in /root/miniconda3/envs/ldm/bin/python)
frame
#_56: PyFunction_Vectorcall + 0x10b (0x55b1c227286b in /root/miniconda3/envs/ldm/bin/python)
frame
#57: PyVectorcall_Call + 0x71 (0x55b1c2224041 in /root/miniconda3/envs/ldm/bin/python)
frame
#_58: PyEval_EvalFrameDefault + 0x1fdb (0x55b1c22a999b in /root/miniconda3/envs/ldm/bin/python)
frame
#_59: PyFunction_Vectorcall + 0x10b (0x55b1c227286b in /root/miniconda3/envs/ldm/bin/python)
frame
#60: + 0x10075e (0x55b1c21e775e in /root/miniconda3/envs/ldm/bin/python)
frame
#_61: PyEval_EvalCodeWithName + 0x2d2 (0x55b1c2271a92 in /root/miniconda3/envs/ldm/bin/python)
frame
#62: + 0x18bd20 (0x55b1c2272d20 in /root/miniconda3/envs/ldm/bin/python)
frame
#_63: + 0x10011a (0x55b1c21e711a in /root/miniconda3/envs/ldm/bin/python)

Epoch 37: 69%|\u258b| 227/328 [18:34<08:13, 4.89s/it, loss=0.794, v_num=2, train/loss_simple_step=0.792, train/loss_vlb_step=0.0081, traTraceback (most recent call last):
File "/root/miniconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1045, in _run_train
self.fit_loop.run()
File "/root/miniconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 111, in run
self.advance(*args, **kwargs)
File "/root/miniconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 200, in advance
epoch_output = self.epoch_loop.run(train_dataloader)
File "/root/miniconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 111, in run
self.advance(*args, **kwargs)
File "/root/miniconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 130, in advance
batch_output = self.batch_loop.run(batch, self.iteration_count, self._dataloader_idx)
File "/root/miniconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 101, in run
super().run(batch, batch_idx, dataloader_idx)
File "/root/miniconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 111, in run
self.advance(*args, **kwargs)
File "/root/miniconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 148, in advance
result = self._run_optimization(batch_idx, split_batch, opt_idx, optimizer)
File "/root/miniconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 202, in _run_optimization
self._optimizer_step(optimizer, opt_idx, batch_idx, closure)
File "/root/miniconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 396, in _optimizer_step
model_ref.optimizer_step(
File "/root/miniconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/core/lightning.py", line 1618, in optimizer_step
optimizer.step(closure=optimizer_closure)
File "/root/miniconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/core/optimizer.py", line 209, in step
self.__optimizer_step(*args, closure=closure, profiler_name=profiler_name, **kwargs)
File "/root/miniconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/core/optimizer.py", line 129, in __optimizer_step
trainer.accelerator.optimizer_step(optimizer, self._optimizer_idx, lambda_closure=closure, **kwargs)
File "/root/miniconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 296, in optimizer_step
self.run_optimizer_step(optimizer, opt_idx, lambda_closure, **kwargs)
File "/root/miniconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 303, in run_optimizer_step
self.training_type_plugin.optimizer_step(optimizer, lambda_closure=lambda_closure, **kwargs)
File "/root/miniconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 226, in optimizer_step
optimizer.step(closure=lambda_closure, **kwargs)
File "/root/miniconda3/envs/ldm/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 26, in decorate_context
return func(*args, **kwargs)
File "/root/miniconda3/envs/ldm/lib/python3.8/site-packages/torch/optim/adamw.py", line 65, in step
loss = closure()
File "/root/miniconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 236, in _training_step_and_backward_closure
result = self.training_step_and_backward(split_batch, batch_idx, opt_idx, optimizer, hiddens)
File "/root/miniconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 537, in training_step_and_backward
result = self._training_step(split_batch, batch_idx, opt_idx, hiddens)
File "/root/miniconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 307, in _training_step
training_step_output = self.trainer.accelerator.training_step(step_kwargs)
File "/root/miniconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 193, in training_step
return self.training_type_plugin.training_step(*step_kwargs.values())
File "/root/miniconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 383, in training_step
return self.model(*args, **kwargs)
File "/root/miniconda3/envs/ldm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/root/miniconda3/envs/ldm/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 619, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/root/miniconda3/envs/ldm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/root/miniconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/overrides/base.py", line 82, in forward
output = self.module.training_step(*inputs, **kwargs)
File "/root/Desktop/ldm/ldm/models/diffusion/ddpm.py", line 343, in training_step
loss, loss_dict = self.shared_step(batch)
File "/root/Desktop/ldm/ldm/models/diffusion/ddpm.py", line 887, in shared_step
x, c = self.get_input(batch, self.first_stage_key)
File "/root/miniconda3/envs/ldm/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 26, in decorate_context
return func(*args, **kwargs)
File "/root/Desktop/ldm/ldm/models/diffusion/ddpm.py", line 661, in get_input
z = self.get_first_stage_encoding(encoder_posterior).detach()
File "/root/Desktop/ldm/ldm/models/diffusion/ddpm.py", line 544, in get_first_stage_encoding
z = encoder_posterior.sample()
File "/root/Desktop/ldm/ldm/modules/distributions/distributions.py", line 36, in sample
x = self.mean + self.std * torch.randn(self.mean.shape).to(device=self.parameters.device)
File "/root/miniconda3/envs/ldm/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 21388) is killed by signal: Aborted.
`

I am sure it is related to this issue , but unable to fix by setting rank_zero_only=True.

Any help is appreciated

@poorfriend
Copy link

poorfriend commented Mar 10, 2022

@kaihe do you fix this problem, i meet the same problem

@kaihe
Copy link
Author

kaihe commented Mar 11, 2022

@kaihe do you fix this problem, i meet the same problem

I upgrade pytorch-lightning==1.5.10,
you may found some pytorch-lightning related error message, easy fix by adjusting some input format

@thunanguyen
Copy link

I think this problem is related to the data loader, have you tried to reduce num_workers?

@justinpinkney
Copy link

I upgraded lighting to 1.4.6 (from 1.4.2) and this seems to have fixed the issue for me.

@illtellyoulater
Copy link

@kaihe could you please remove all the "#N" from your post which got parsed as links to other issues and basically now everything looks like one giant issue? :)
Thank you!

@kaihe kaihe closed this as completed Apr 25, 2022
@kaihe
Copy link
Author

kaihe commented Apr 25, 2022

done and close, this repo works fine with higher versions of pytorch-lightning, like pytorch-lightning==1.6.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants