Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA error: device-side assert triggered #6

Closed
wencc-ucas opened this issue Apr 8, 2022 · 11 comments
Closed

CUDA error: device-side assert triggered #6

wencc-ucas opened this issue Apr 8, 2022 · 11 comments

Comments

@wencc-ucas
Copy link

Hi, authors,

Thanks a lot for your awesome work.

I met this error, have you ever met it?

RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/home/mmvc/anaconda3/envs/pytorch19/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, **kwargs)
File "/home/mmvc/anaconda3/envs/pytorch19/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/mmvc/Congcong/Stratified-Transformer/model/stratified_transformer.py", line 438, in forward
feats = layer(feats, xyz, batch, neighbor_idx)
File "/home/mmvc/anaconda3/envs/pytorch19/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/mmvc/Congcong/Stratified-Transformer/model/stratified_transformer.py", line 357, in forward
feats = self.kpconv(xyz, xyz, neighbor_idx, feats)
File "/home/mmvc/anaconda3/envs/pytorch19/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/mmvc/anaconda3/envs/pytorch19/lib/python3.8/site-packages/torch_points3d/modules/KPConv/kernels.py", line 83, in forward
new_feat = KPConv_ops(
File "/home/mmvc/anaconda3/envs/pytorch19/lib/python3.8/site-packages/torch_points3d/modules/KPConv/convolution_ops.py",line 95, in KPConv_ops
neighborhood_features = gather(features, neighbors_indices)
File "/home/mmvc/anaconda3/envs/pytorch19/lib/python3.8/site-packages/torch_points3d/core/common_modules/gathering.py", line 10, in gather
idx[idx == -1] = x.shape[0] - 1 # Shadow point
RuntimeError: CUDA error: device-side assert triggered

@X-Lai
Copy link
Collaborator

X-Lai commented Apr 8, 2022

Thank you for your interest in our work. May I ask have you set the right data path and the right training GPUs?

@wencc-ucas
Copy link
Author

Thank you for your prompt response. I have modified the data path and GPUs.

The error above is when I specify only one GPU, when I use multiple GPUs the error is the following:

Traceback (most recent call last):
File "train.py", line 547, in
main()
File "train.py", line 84, in main
mp.spawn(main_worker, nprocs=args.ngpus_per_node, args=(args.ngpus_per_node, args))
File "/home/mmvc/anaconda3/envs/pytorch19/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/mmvc/anaconda3/envs/pytorch19/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes
while not context.join():
File "/home/mmvc/anaconda3/envs/pytorch19/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 118, in join
raise Exception(msg)
Exception:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/home/mmvc/anaconda3/envs/pytorch19/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File "/home/mmvc/Congcong/Stratified-Transformer/train.py", line 308, in main_worker
loss_train, mIoU_train, mAcc_train, allAcc_train = train(train_loader, model, criterion, optimizer, epoch, scaler, scheduler)
File "/home/mmvc/Congcong/Stratified-Transformer/train.py", line 380, in train
output = model(feat, coord, offset, batch, neighbor_idx)
File "/home/mmvc/anaconda3/envs/pytorch19/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/mmvc/anaconda3/envs/pytorch19/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 619, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/home/mmvc/anaconda3/envs/pytorch19/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/mmvc/Congcong/Stratified-Transformer/model/stratified_transformer.py", line 449, in forward
feats, xyz, offset, feats_down, xyz_down, offset_down = layer(feats, xyz, offset)
File "/home/mmvc/anaconda3/envs/pytorch19/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/mmvc/Congcong/Stratified-Transformer/model/stratified_transformer.py", line 291, in forward
new_window_size = 2 * torch.tensor([self.window_size]*3).type_as(xyz).to(xyz.device)
RuntimeError: CUDA error: invalid device function

@X-Lai
Copy link
Collaborator

X-Lai commented Apr 8, 2022

May I ask have you compiled the pointops in /lib? And can you locate which line causes this error?

@wencc-ucas
Copy link
Author

I only compiled pointops2 according to your instruction.

For one gpu,
File "/***/Stratified-Transformer/model/stratified_transformer.py", line 357, in forward
feats = self.kpconv(xyz, xyz, neighbor_idx, feats)

For multi gpu,
File "/***/Stratified-Transformer/model/stratified_transformer.py", line 291, in forward
new_window_size = 2 * torch.tensor([self.window_size]*3).type_as(xyz).to(xyz.device)

@X-Lai
Copy link
Collaborator

X-Lai commented Apr 8, 2022

The error may be caused by the kpconv provided by torch-points3d. I wonder whether you successfully install it? Can you double check that torch-points3d can work smoothly?

@wencc-ucas
Copy link
Author

Thanks. I have checked it. But when I use multi gpus, this error has not appeared. Kpconv provided by torch-points3d can work well.

@X-Lai
Copy link
Collaborator

X-Lai commented Apr 8, 2022

Can you run successfully now? If you use one GPU, remember to add CUDA_VISIBLE_DEVICES=0 before your python command.

@wencc-ucas
Copy link
Author

I have solved the bug of one gpu by modifing this.

But now I still have the error of line 291 both using one gpu and multi gpus.

@YangParky
Copy link

I have solved the bug of one gpu by modifing this.

But now I still have the error of line 291 both using one gpu and multi gpus.

Hi, How do you modify it?

@basil-hayden
Copy link

I have solved the bug of one gpu by modifing this.
But now I still have the error of line 291 both using one gpu and multi gpus.

Hi, How do you modify it?

model = torch.nn.DataParallel(model.cuda()) ----> model = model.cuda()

@praj441
Copy link

praj441 commented Dec 30, 2022

I have solved the bug of one gpu by modifing this.
But now I still have the error of line 291 both using one gpu and multi gpus.

Hi, How do you modify it?

model = torch.nn.DataParallel(model.cuda()) ----> model = model.cuda()

But, that way we can't use multi-GPU training. I am also getting this error when using model = torch.nn.DataParallel(model.cuda()).

any suggestions?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants