You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.
I'm trying to train a custom data set with train.py as seen in the tools folder. I followed the instructions to build a consistent dataloader. (cross checking with the coco dataloader it seems all the types/dim are in order)
Using the CUDA_LAUNCH_BLOCKING=1 before python3 train.py (to get more info in the output)
I get the following error:
2018-12-03 17:06:30,676 maskrcnn_benchmark.trainer INFO: Start training
/ibm/gpfs-homes/ial/.local/tmp_compilation/pytorch-master-at10.0/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [31,0,0] Assertion `t >= 0 && t < n_classes` failed.
THCudaCheck FAIL file=/ibm/gpfs-homes/ial/.local/tmp_compilation/pytorch-master-at10.0/pytorch/aten/src/THCUNN/generic/ClassNLLCriterion.cu line=111 error=59 : device-side assert triggered
Traceback (most recent call last):
File "relational_rxn_graphs/detector/train.py", line 228, in <module>
main()
File "relational_rxn_graphs/detector/train.py", line 221, in main
model = train(cfg, data_cfg, args.local_rank, args.distributed)
File "relational_rxn_graphs/detector/train.py", line 71, in train
arguments,
File "/ibm/gpfs-homes/ial/github/maskrcnn-benchmark/maskrcnn_benchmark/engine/trainer.py", line 66, in do_train
loss_dict = model(images, targets)
File "/u/ial/.local/deeplearning/pytorch-master/lib64/python3.5/site-packages/torch/nn/modules/module.py", line 489, in __call__
result = self.forward(*input, **kwargs)
File "/ibm/gpfs-homes/ial/github/maskrcnn-benchmark/maskrcnn_benchmark/modeling/detector/generalized_rcnn.py", line 52, in forward
x, result, detector_losses = self.roi_heads(features, proposals, targets)
File "/u/ial/.local/deeplearning/pytorch-master/lib64/python3.5/site-packages/torch/nn/modules/module.py", line 489, in __call__
result = self.forward(*input, **kwargs)
File "/ibm/gpfs-homes/ial/github/maskrcnn-benchmark/maskrcnn_benchmark/modeling/roi_heads/roi_heads.py", line 23, in forward
x, detections, loss_box = self.box(features, proposals, targets)
File "/u/ial/.local/deeplearning/pytorch-master/lib64/python3.5/site-packages/torch/nn/modules/module.py", line 489, in __call__
result = self.forward(*input, **kwargs)
File "/ibm/gpfs-homes/ial/github/maskrcnn-benchmark/maskrcnn_benchmark/modeling/roi_heads/box_head/box_head.py", line 55, in forward
[class_logits], [box_regression]
File "/ibm/gpfs-homes/ial/github/maskrcnn-benchmark/maskrcnn_benchmark/modeling/roi_heads/box_head/loss.py", line 139, in __call__
classification_loss = F.cross_entropy(class_logits, labels)
File "/u/ial/.local/deeplearning/pytorch-master/lib64/python3.5/site-packages/torch/nn/functional.py", line 1928, in cross_entropy
return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction)
File "/u/ial/.local/deeplearning/pytorch-master/lib64/python3.5/site-packages/torch/nn/functional.py", line 1771, in nll_loss
ret = torch._C._nn.nll_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index)
RuntimeError: cuda runtime error (59) : device-side assert triggered at /ibm/gpfs-homes/ial/.local/tmp_compilation/pytorch-master-at10.0/pytorch/aten/src/THCUNN/generic/ClassNLLCriterion.cu:111
I have the following configuration
PyTorch version: 1.0.0a0+5c89190
Is debug build: No
CUDA used to build PyTorch: 9.2.148
OS: Red Hat Enterprise Linux Server release 7.5 (Maipo)
GCC version: (GCC) 6.4.1 20170720 (Advance-Toolchain-at10.0) IBM AT 10 branch, based on subversion id 250395.
CMake version: version 2.8.12.2
Python version: 3.5
Is CUDA available: Yes
CUDA runtime version: 9.2.148
GPU models and configuration:
GPU 0: Tesla P100-SXM2-16GB
GPU 1: Tesla P100-SXM2-16GB
GPU 2: Tesla P100-SXM2-16GB
GPU 3: Tesla P100-SXM2-16GB
Nvidia driver version: 396.37
cuDNN version: Probably one of the following:
/usr/local/cudnn-8.0-v5.1/lib64/libcudnn.so.5.1.10
/usr/local/cudnn-8.0-v5.1/lib64/libcudnn_static.a
/usr/local/cudnn-8.0-v6.0/lib64/libcudnn.so.6.0.20
/usr/local/cudnn-8.0-v6.0/lib64/libcudnn_static.a
/usr/local/cudnn-9.0-v7.0/lib64/libcudnn.so.7.0.5
/usr/local/cudnn-9.0-v7.0/lib64/libcudnn_static.a
/usr/local/cudnn-9.1-v7.1.2/lib64/libcudnn.so.7.1.2
/usr/local/cudnn-9.1-v7.1.2/lib64/libcudnn_static.a
/usr/local/cudnn-9.2-v7.1.3/lib64/libcudnn.so.7.1.3
/usr/local/cudnn-9.2-v7.1.3/lib64/libcudnn_static.a
/usr/local/cudnn-9.2-v7.2.1/lib64/libcudnn.so.7.2.1
/usr/local/cudnn-9.2-v7.2.1/lib64/libcudnn_static.a
This is a continuation of #230, it seems that the initial bug was fixed by reinstalling the library + torch.
The text was updated successfully, but these errors were encountered:
It's working now.
Yeah that is what I thought, I was using a dataloader which had a function to return the nb_classes, I did not realize you had to add it to the config. I thought that this was the standard way to get the nb_classes... big mistake.
Thanks a lot!
@Nacho114@fmassa running into this error as well.. In my config, I have ROI_BOX_HEAD.NUM_CLASSES set to 3. I'm using the R-50.pkl pretrained weights, so I figured I do not need to follow the instructions in #15. What else is missing? Do you need to set the number of classes within the dataset class? Any help would be appreciated
❓ Questions and Help
I'm trying to train a custom data set with train.py as seen in the tools folder. I followed the instructions to build a consistent dataloader. (cross checking with the coco dataloader it seems all the types/dim are in order)
Using the
CUDA_LAUNCH_BLOCKING=1
before python3 train.py (to get more info in the output)I get the following error:
I have the following configuration
This is a continuation of #230, it seems that the initial bug was fixed by reinstalling the library + torch.
The text was updated successfully, but these errors were encountered: