CUDA error: device-side assert triggered #6

wencc-ucas · 2022-04-08T10:51:33Z

Hi, authors,

Thanks a lot for your awesome work.

I met this error, have you ever met it?

RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/home/mmvc/anaconda3/envs/pytorch19/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, **kwargs)
File "/home/mmvc/anaconda3/envs/pytorch19/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/mmvc/Congcong/Stratified-Transformer/model/stratified_transformer.py", line 438, in forward
feats = layer(feats, xyz, batch, neighbor_idx)
File "/home/mmvc/anaconda3/envs/pytorch19/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/mmvc/Congcong/Stratified-Transformer/model/stratified_transformer.py", line 357, in forward
feats = self.kpconv(xyz, xyz, neighbor_idx, feats)
File "/home/mmvc/anaconda3/envs/pytorch19/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/mmvc/anaconda3/envs/pytorch19/lib/python3.8/site-packages/torch_points3d/modules/KPConv/kernels.py", line 83, in forward
new_feat = KPConv_ops(
File "/home/mmvc/anaconda3/envs/pytorch19/lib/python3.8/site-packages/torch_points3d/modules/KPConv/convolution_ops.py",line 95, in KPConv_ops
neighborhood_features = gather(features, neighbors_indices)
File "/home/mmvc/anaconda3/envs/pytorch19/lib/python3.8/site-packages/torch_points3d/core/common_modules/gathering.py", line 10, in gather
idx[idx == -1] = x.shape[0] - 1 # Shadow point
RuntimeError: CUDA error: device-side assert triggered

X-Lai · 2022-04-08T13:26:59Z

Thank you for your interest in our work. May I ask have you set the right data path and the right training GPUs?

wencc-ucas · 2022-04-08T14:44:43Z

Thank you for your prompt response. I have modified the data path and GPUs.

The error above is when I specify only one GPU, when I use multiple GPUs the error is the following:

Traceback (most recent call last):
File "train.py", line 547, in
main()
File "train.py", line 84, in main
mp.spawn(main_worker, nprocs=args.ngpus_per_node, args=(args.ngpus_per_node, args))
File "/home/mmvc/anaconda3/envs/pytorch19/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/mmvc/anaconda3/envs/pytorch19/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes
while not context.join():
File "/home/mmvc/anaconda3/envs/pytorch19/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 118, in join
raise Exception(msg)
Exception:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/home/mmvc/anaconda3/envs/pytorch19/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File "/home/mmvc/Congcong/Stratified-Transformer/train.py", line 308, in main_worker
loss_train, mIoU_train, mAcc_train, allAcc_train = train(train_loader, model, criterion, optimizer, epoch, scaler, scheduler)
File "/home/mmvc/Congcong/Stratified-Transformer/train.py", line 380, in train
output = model(feat, coord, offset, batch, neighbor_idx)
File "/home/mmvc/anaconda3/envs/pytorch19/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/mmvc/anaconda3/envs/pytorch19/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 619, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/home/mmvc/anaconda3/envs/pytorch19/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/mmvc/Congcong/Stratified-Transformer/model/stratified_transformer.py", line 449, in forward
feats, xyz, offset, feats_down, xyz_down, offset_down = layer(feats, xyz, offset)
File "/home/mmvc/anaconda3/envs/pytorch19/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/mmvc/Congcong/Stratified-Transformer/model/stratified_transformer.py", line 291, in forward
new_window_size = 2 * torch.tensor([self.window_size]*3).type_as(xyz).to(xyz.device)
RuntimeError: CUDA error: invalid device function

X-Lai · 2022-04-08T14:51:33Z

May I ask have you compiled the pointops in /lib? And can you locate which line causes this error?

wencc-ucas · 2022-04-08T15:25:15Z

I only compiled pointops2 according to your instruction.

For one gpu,
File "/***/Stratified-Transformer/model/stratified_transformer.py", line 357, in forward
feats = self.kpconv(xyz, xyz, neighbor_idx, feats)

For multi gpu,
File "/***/Stratified-Transformer/model/stratified_transformer.py", line 291, in forward
new_window_size = 2 * torch.tensor([self.window_size]*3).type_as(xyz).to(xyz.device)

X-Lai · 2022-04-08T16:50:10Z

The error may be caused by the kpconv provided by torch-points3d. I wonder whether you successfully install it? Can you double check that torch-points3d can work smoothly?

wencc-ucas · 2022-04-08T16:57:59Z

Thanks. I have checked it. But when I use multi gpus, this error has not appeared. Kpconv provided by torch-points3d can work well.

X-Lai · 2022-04-08T17:01:54Z

Can you run successfully now? If you use one GPU, remember to add CUDA_VISIBLE_DEVICES=0 before your python command.

wencc-ucas · 2022-04-08T17:21:33Z

I have solved the bug of one gpu by modifing this.

But now I still have the error of line 291 both using one gpu and multi gpus.

YangParky · 2022-05-27T04:15:58Z

I have solved the bug of one gpu by modifing this.

But now I still have the error of line 291 both using one gpu and multi gpus.

Hi, How do you modify it?

basil-hayden · 2022-06-26T13:42:23Z

I have solved the bug of one gpu by modifing this.
But now I still have the error of line 291 both using one gpu and multi gpus.

Hi, How do you modify it?

model = torch.nn.DataParallel(model.cuda()) ----> model = model.cuda()

praj441 · 2022-12-30T06:03:08Z

I have solved the bug of one gpu by modifing this.
But now I still have the error of line 291 both using one gpu and multi gpus.

Hi, How do you modify it?

model = torch.nn.DataParallel(model.cuda()) ----> model = model.cuda()

But, that way we can't use multi-GPU training. I am also getting this error when using model = torch.nn.DataParallel(model.cuda()).

any suggestions?

wencc-ucas closed this as completed Apr 9, 2022

leenamx mentioned this issue Apr 11, 2024

Error when Training s3dis #98

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA error: device-side assert triggered #6

CUDA error: device-side assert triggered #6

wencc-ucas commented Apr 8, 2022

X-Lai commented Apr 8, 2022

wencc-ucas commented Apr 8, 2022

X-Lai commented Apr 8, 2022

wencc-ucas commented Apr 8, 2022

X-Lai commented Apr 8, 2022

wencc-ucas commented Apr 8, 2022

X-Lai commented Apr 8, 2022

wencc-ucas commented Apr 8, 2022

YangParky commented May 27, 2022

basil-hayden commented Jun 26, 2022

praj441 commented Dec 30, 2022

CUDA error: device-side assert triggered #6

CUDA error: device-side assert triggered #6

Comments

wencc-ucas commented Apr 8, 2022

X-Lai commented Apr 8, 2022

wencc-ucas commented Apr 8, 2022

X-Lai commented Apr 8, 2022

wencc-ucas commented Apr 8, 2022

X-Lai commented Apr 8, 2022

wencc-ucas commented Apr 8, 2022

X-Lai commented Apr 8, 2022

wencc-ucas commented Apr 8, 2022

YangParky commented May 27, 2022

basil-hayden commented Jun 26, 2022

praj441 commented Dec 30, 2022