Skip to content
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.

Error when trying to train: RuntimeError: cuda runtime error (59) : device-side assert triggered #243

Closed
Nacho114 opened this issue Dec 3, 2018 · 4 comments

Comments

@Nacho114
Copy link

Nacho114 commented Dec 3, 2018

❓ Questions and Help

I'm trying to train a custom data set with train.py as seen in the tools folder. I followed the instructions to build a consistent dataloader. (cross checking with the coco dataloader it seems all the types/dim are in order)

Using the CUDA_LAUNCH_BLOCKING=1 before python3 train.py (to get more info in the output)
I get the following error:

2018-12-03 17:06:30,676 maskrcnn_benchmark.trainer INFO: Start training
/ibm/gpfs-homes/ial/.local/tmp_compilation/pytorch-master-at10.0/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [31,0,0] Assertion `t >= 0 && t < n_classes` failed.
THCudaCheck FAIL file=/ibm/gpfs-homes/ial/.local/tmp_compilation/pytorch-master-at10.0/pytorch/aten/src/THCUNN/generic/ClassNLLCriterion.cu line=111 error=59 : device-side assert triggered
Traceback (most recent call last):
  File "relational_rxn_graphs/detector/train.py", line 228, in <module>
    main()
  File "relational_rxn_graphs/detector/train.py", line 221, in main
    model = train(cfg, data_cfg, args.local_rank, args.distributed)
  File "relational_rxn_graphs/detector/train.py", line 71, in train
    arguments,
  File "/ibm/gpfs-homes/ial/github/maskrcnn-benchmark/maskrcnn_benchmark/engine/trainer.py", line 66, in do_train
    loss_dict = model(images, targets)
  File "/u/ial/.local/deeplearning/pytorch-master/lib64/python3.5/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/ibm/gpfs-homes/ial/github/maskrcnn-benchmark/maskrcnn_benchmark/modeling/detector/generalized_rcnn.py", line 52, in forward
    x, result, detector_losses = self.roi_heads(features, proposals, targets)
  File "/u/ial/.local/deeplearning/pytorch-master/lib64/python3.5/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/ibm/gpfs-homes/ial/github/maskrcnn-benchmark/maskrcnn_benchmark/modeling/roi_heads/roi_heads.py", line 23, in forward
    x, detections, loss_box = self.box(features, proposals, targets)
  File "/u/ial/.local/deeplearning/pytorch-master/lib64/python3.5/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/ibm/gpfs-homes/ial/github/maskrcnn-benchmark/maskrcnn_benchmark/modeling/roi_heads/box_head/box_head.py", line 55, in forward
    [class_logits], [box_regression]
  File "/ibm/gpfs-homes/ial/github/maskrcnn-benchmark/maskrcnn_benchmark/modeling/roi_heads/box_head/loss.py", line 139, in __call__
    classification_loss = F.cross_entropy(class_logits, labels)
  File "/u/ial/.local/deeplearning/pytorch-master/lib64/python3.5/site-packages/torch/nn/functional.py", line 1928, in cross_entropy
    return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction)
  File "/u/ial/.local/deeplearning/pytorch-master/lib64/python3.5/site-packages/torch/nn/functional.py", line 1771, in nll_loss
    ret = torch._C._nn.nll_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index)
RuntimeError: cuda runtime error (59) : device-side assert triggered at /ibm/gpfs-homes/ial/.local/tmp_compilation/pytorch-master-at10.0/pytorch/aten/src/THCUNN/generic/ClassNLLCriterion.cu:111

I have the following configuration

PyTorch version: 1.0.0a0+5c89190
Is debug build: No
CUDA used to build PyTorch: 9.2.148

OS: Red Hat Enterprise Linux Server release 7.5 (Maipo)
GCC version: (GCC) 6.4.1 20170720 (Advance-Toolchain-at10.0) IBM AT 10 branch, based on subversion id 250395.
CMake version: version 2.8.12.2

Python version: 3.5
Is CUDA available: Yes
CUDA runtime version: 9.2.148
GPU models and configuration:
GPU 0: Tesla P100-SXM2-16GB
GPU 1: Tesla P100-SXM2-16GB
GPU 2: Tesla P100-SXM2-16GB
GPU 3: Tesla P100-SXM2-16GB

Nvidia driver version: 396.37
cuDNN version: Probably one of the following:
/usr/local/cudnn-8.0-v5.1/lib64/libcudnn.so.5.1.10
/usr/local/cudnn-8.0-v5.1/lib64/libcudnn_static.a
/usr/local/cudnn-8.0-v6.0/lib64/libcudnn.so.6.0.20
/usr/local/cudnn-8.0-v6.0/lib64/libcudnn_static.a
/usr/local/cudnn-9.0-v7.0/lib64/libcudnn.so.7.0.5
/usr/local/cudnn-9.0-v7.0/lib64/libcudnn_static.a
/usr/local/cudnn-9.1-v7.1.2/lib64/libcudnn.so.7.1.2
/usr/local/cudnn-9.1-v7.1.2/lib64/libcudnn_static.a
/usr/local/cudnn-9.2-v7.1.3/lib64/libcudnn.so.7.1.3
/usr/local/cudnn-9.2-v7.1.3/lib64/libcudnn_static.a
/usr/local/cudnn-9.2-v7.2.1/lib64/libcudnn.so.7.2.1
/usr/local/cudnn-9.2-v7.2.1/lib64/libcudnn_static.a

This is a continuation of #230, it seems that the initial bug was fixed by reinstalling the library + torch.

@fmassa
Copy link
Contributor

fmassa commented Dec 3, 2018

This probably means that your class labels are larger than the number of outputs from the model.
Could you check that?

@Nacho114
Copy link
Author

Nacho114 commented Dec 4, 2018

It's working now.
Yeah that is what I thought, I was using a dataloader which had a function to return the nb_classes, I did not realize you had to add it to the config. I thought that this was the standard way to get the nb_classes... big mistake.
Thanks a lot!

@Nacho114 Nacho114 closed this as completed Dec 4, 2018
@jbitton
Copy link

jbitton commented Dec 14, 2018

@Nacho114 @fmassa running into this error as well.. In my config, I have ROI_BOX_HEAD.NUM_CLASSES set to 3. I'm using the R-50.pkl pretrained weights, so I figured I do not need to follow the instructions in #15. What else is missing? Do you need to set the number of classes within the dataset class? Any help would be appreciated

@fmassa
Copy link
Contributor

fmassa commented Dec 14, 2018

Hi @jbitton ,

I've replied to your other issue in #273. Let's continue the discussion there

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants