Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Meet RuntimeError in mmdet3d/ops/spconv/src/indice_cuda.cu #37

Closed
happinesslz opened this issue Jul 22, 2020 · 11 comments
Closed

Meet RuntimeError in mmdet3d/ops/spconv/src/indice_cuda.cu #37

happinesslz opened this issue Jul 22, 2020 · 11 comments

Comments

@happinesslz
Copy link

I also meet the same bugs with #21

@Tai-Wang
Copy link
Member

Please follow the template for "error report" to describe your issue. We need more information to debug.

BTW, the referenced issue is simply an "out of memory" error caused by adding codes unrelated to the codebase. You can check whether there is any possibility leading to this runtime error case.

@happinesslz
Copy link
Author

@Tai-Wang Thanks for your reply.

Describe the bug
I can only run the pointpillar successfully without spconv. Maybe the spconv causes the error. I also try to reduce the number of "samples_per_gpu" to 1, but still get the same error. I run the code on Titan V+CUDA 10.1+pytorch=1.5.1/1.3.1+mmcv-full=1.0.3/1.0.2.

Reproduction
Did you make any modifications on the code or config? Did you understand what you have modified?
No.

Environment
sys.platform: linux
Python: 3.7.6 (default, Jan 8 2020, 19:59:22) [GCC 7.3.0]
CUDA available: True
CUDA_HOME: /usr/local/cuda-10.1
NVCC: Cuda compilation tools, release 10.1, V10.1.168
GPU 0,1: TITAN V
GCC: gcc (GCC) 5.4.0
PyTorch: 1.5.1
PyTorch compiling details: PyTorch built with:

GCC 7.3
C++ Version: 201402
Intel(R) Math Kernel Library Version 2020.0.1 Product Build 20200208 for Intel(R) 64 architecture applications
Intel(R) MKL-DNN v0.21.1 (Git Hash 7d2fd500bc78936d1d648ca713b901012f470dbc)
OpenMP 201511 (a.k.a. OpenMP 4.5)
NNPACK is enabled
CPU capability usage: AVX2
CUDA Runtime 10.1
NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_37,code=compute_37
CuDNN 7.6.3
Magma 2.5.2
Build settings: BLAS=MKL, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -fopenmp -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DUSE_INTERNAL_THREADPOOL_IMPL -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_STATIC_DISPATCH=OFF,
TorchVision: 0.6.0a0+35d732a
OpenCV: 4.3.0
MMCV: 1.0.2
MMDetection: 2.3.0rc0+3c21dd0
MMDetection3D: 0.5.0+unknown
MMDetection3D Compiler: GCC 5.4
MMDetection3D CUDA Compiler: 10.1

Error traceback

2020-07-23 09:57:07,433 - mmdet - INFO - Start running, host: zliu@1035, work_dir: /home/zliu/mmdetection3d_1035/work_dirs/dv_mvx-fpn_second_secfpn_adamw_2x8_80e_kitti-3d-3class
2020-07-23 09:57:07,433 - mmdet - INFO - workflow: [('train', 1)], max: 40 epochs
Traceback (most recent call last):
  File "./tools/train.py", line 166, in <module>
    main()
  File "./tools/train.py", line 162, in main
    meta=meta)
  File "/home/zliu/mmdetection3d_1035/mmdetection/mmdet/apis/train.py", line 128, in train_detector
    runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
  File "/home/zliu/anaconda3/envs/mmdetection3d/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 122, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/home/zliu/anaconda3/envs/mmdetection3d/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 32, in train
    **kwargs)
  File "/home/zliu/anaconda3/envs/mmdetection3d/lib/python3.7/site-packages/mmcv/parallel/data_parallel.py", line 31, in train_step
    return self.module.train_step(*inputs[0], **kwargs[0])
  File "/home/zliu/mmdetection3d_1035/mmdetection/mmdet/models/detectors/base.py", line 237, in train_step
    losses = self(**data)
  File "/home/zliu/anaconda3/envs/mmdetection3d/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/zliu/mmdetection3d_1035/mmdet3d/models/detectors/base.py", line 60, in forward
    return self.forward_train(**kwargs)
  File "/home/zliu/mmdetection3d_1035/mmdet3d/models/detectors/mvx_two_stage.py", line 266, in forward_train
    points, img=img, img_metas=img_metas)
  File "/home/zliu/mmdetection3d_1035/mmdet3d/models/detectors/mvx_two_stage.py", line 201, in extract_feat
    pts_feats = self.extract_pts_feat(points, img_feats, img_metas)
  File "/home/zliu/mmdetection3d_1035/mmdet3d/models/detectors/mvx_faster_rcnn.py", line 54, in extract_pts_feat
    x = self.pts_middle_encoder(voxel_features, feature_coors, batch_size)
  File "/home/zliu/anaconda3/envs/mmdetection3d/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/zliu/mmdetection3d_1035/mmdet3d/models/middle_encoders/sparse_encoder.py", line 97, in forward
    x = self.conv_input(input_sp_tensor)
  File "/home/zliu/anaconda3/envs/mmdetection3d/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/zliu/mmdetection3d_1035/mmdet3d/ops/spconv/modules.py", line 130, in forward
    input = module(input)
  File "/home/zliu/anaconda3/envs/mmdetection3d/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/zliu/mmdetection3d_1035/mmdet3d/ops/spconv/conv.py", line 168, in forward
    grid=input.grid)
  File "/home/zliu/mmdetection3d_1035/mmdet3d/ops/spconv/ops.py", line 94, in get_indice_pairs
    int(transpose))
RuntimeError: /home/zliu/mmdetection3d_1035/mmdet3d/ops/spconv/src/indice_cuda.cu 124
cuda execution failed with error 2

@Tai-Wang
Copy link
Member

Tai-Wang commented Jul 23, 2020

I think there are no problems in the provided information. Could you please observe the CPU/GPU memory cost when running the code? Just to confirm it is an out of memory error. Also please tell us which config you are using.

@happinesslz
Copy link
Author

@Tai-Wang
The config file is dv_mvx-fpn_second_secfpn_adamw_2x8_80e_kitti-3d-3class.py in configs/mvxnet.

I also observe the CPU/GPU memory cost is normal, which just costs about 6G for GPU memory. Then this error occurred. I think 12G GPU memory is enough for running the config. I also reduce the number of "samples_per_gpu" to 1, but still obtain the same error.

@Tai-Wang
Copy link
Member

I have just checked this config, and it runs well on my 2080Ti with about 8.5G for GPU memory.

Maybe you can first check whether the code still takes up the memory after the error happens. Then please try other configs using spconv, like configs with hard voxelization hv_second_secfpn_6x8_80e_kitti-3d-3class.py in the configs/second.

@happinesslz
Copy link
Author

Other configs with spconv will meet similar bugs. Can you provide your conda environment in details? Thanks.

@Tai-Wang
Copy link
Member

Other configs with spconv will meet similar bugs. Can you provide your conda environment in details? Thanks.

You can refer to this comment for my environment. From my experience, cuda 10.0-10.2, pytorch 1.4.0-1.5.1, python 3.7 and gcc 5.4/5.5 should be ok.

@happinesslz
Copy link
Author

Thanks, I will have a try on RTX2080Ti.

@niezhongliang
Copy link

Thanks, I will have a try on RTX2080Ti.

Hi, I meet the same question with you, when I run pcd_demo.py in the server. However, I tested it on my personal desktop, it could run. Have you solved it?

@WWW2323
Copy link

WWW2323 commented Jun 13, 2021

Thanks, I will have a try on RTX2080Ti.

Hi, did you run this project on Titan V or 1080Ti before? I want to figure out if it is caused by GPU type, because i can run this project on 2080Ti, but fail on Titan V.

@ArthDh
Copy link

ArthDh commented Aug 26, 2021

Hi,
Has there been any fix regarding this? Facing similar issue. I am able to train configs/pointpillars/hv_pointpillars_secfpn_sbn_2x16_2x_waymoD5-3d-car.py
However, run into:

Traceback (most recent call last):
  File "tools/train.py", line 233, in <module>
    main()
  File "tools/train.py", line 222, in main
    train_model(
  File "/home/svcl-oowl/arth/SVCL/mmdetection3d/mmdet3d/apis/train.py", line 27, in train_model
    train_detector(
  File "/opt/conda/lib/python3.8/site-packages/mmdet/apis/train.py", line 170, in train_detector
    runner.run(data_loaders, cfg.workflow)
  File "/opt/conda/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 50, in train
    self.run_iter(data_batch, train_mode=True, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 29, in run_iter
    outputs = self.model.train_step(data_batch, self.optimizer,
  File "/opt/conda/lib/python3.8/site-packages/mmcv/parallel/data_parallel.py", line 67, in train_step
    return self.module.train_step(*inputs[0], **kwargs[0])
  File "/opt/conda/lib/python3.8/site-packages/mmdet/models/detectors/base.py", line 237, in train_step
    losses = self(**data)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/mmcv/runner/fp16_utils.py", line 97, in new_func
    return old_func(*args, **kwargs)
  File "/home/svcl-oowl/arth/SVCL/mmdetection3d/mmdet3d/models/detectors/base.py", line 58, in forward
    return self.forward_train(**kwargs)
  File "/home/svcl-oowl/arth/SVCL/mmdetection3d/mmdet3d/models/detectors/mvx_two_stage.py", line 272, in forward_train
    img_feats, pts_feats = self.extract_feat(
  File "/home/svcl-oowl/arth/SVCL/mmdetection3d/mmdet3d/models/detectors/mvx_two_stage.py", line 207, in extract_feat
    pts_feats = self.extract_pts_feat(points, img_feats, img_metas)
  File "/home/svcl-oowl/arth/SVCL/mmdetection3d/mmdet3d/models/detectors/mvx_faster_rcnn.py", line 56, in extract_pts_feat
    x = self.pts_middle_encoder(voxel_features, feature_coors, batch_size)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/mmcv/runner/fp16_utils.py", line 97, in new_func
    return old_func(*args, **kwargs)
  File "/home/svcl-oowl/arth/SVCL/mmdetection3d/mmdet3d/models/middle_encoders/sparse_encoder.py", line 112, in forward
    x = self.conv_input(input_sp_tensor)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/svcl-oowl/arth/SVCL/mmdetection3d/mmdet3d/ops/spconv/modules.py", line 130, in forward
    input = module(input)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/svcl-oowl/arth/SVCL/mmdetection3d/mmdet3d/ops/spconv/conv.py", line 183, in forward
    out_features = Fsp.indice_subm_conv(features, self.weight,
  File "/home/svcl-oowl/arth/SVCL/mmdetection3d/mmdet3d/ops/spconv/functional.py", line 64, in forward
    return ops.indice_conv(features, filters, indice_pairs,
  File "/home/svcl-oowl/arth/SVCL/mmdetection3d/mmdet3d/ops/spconv/ops.py", line 116, in indice_conv
    return sparse_conv_ext.indice_conv_fp32(features, filters,
RuntimeError: CUDA error: an illegal memory access was encountered
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: an illegal memory access was encountered
Exception raised from create_event_internal at /pytorch/c10/cuda/CUDACachingAllocator.cpp:733 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7fdca365a2f2 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x5b (0x7fdca365767b in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x809 (0x7fdca38b21f9 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0x54 (0x7fdca36423a4 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #4: <unknown function> + 0x6e473a (0x7fdd1746673a in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #5: <unknown function> + 0x6e47d1 (0x7fdd174667d1 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #26: __libc_start_main + 0xe7 (0x7fdd28f9cbf7 in /lib/x86_64-linux-gnu/libc.so.6)

Aborted (core dumped)

When trying to run:
configs/mvxnet/dv_mvx-fpn_second_secfpn_adamw_2x8_80e_kitti-3d-3class.py

My environment info is as follows:

2021-08-26 03:17:52,042 - mmdet - INFO - Environment info:
------------------------------------------------------------
sys.platform: linux
Python: 3.8.8 (default, Feb 24 2021, 21:46:12) [GCC 7.3.0]
CUDA available: True
GPU 0,1: GeForce RTX 2080 Ti
CUDA_HOME: /usr/local/cuda
NVCC: Build cuda_11.1.TC455_06.29190527_0
GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
PyTorch: 1.8.0+cu111
PyTorch compiling details: PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) Math Kernel Library Version 2020.0.2 Product Build 20200624 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v1.7.0 (Git Hash 7aed236906b1f7a05c0917e5257a1af05e9ff683)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 11.1
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86
  - CuDNN 8.0.5
  - Magma 2.5.2
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.8.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, 

TorchVision: 0.9.0+cu111
OpenCV: 4.5.3
MMCV: 1.3.10
MMCV Compiler: GCC 7.3
MMCV CUDA Compiler: 11.1
MMDetection: 2.15.0
MMSegmentation: 0.16.0
MMDetection3D: 0.15.0+1b2e64c

Any help would be appreciated!

tpoisonooo pushed a commit to tpoisonooo/mmdetection3d that referenced this issue Sep 5, 2022
* [Feature] Add test tool to evaluate backend models on det and cls datasets (open-mmlab#26)

* add test tool and re-orgnize apis.utils

* handle topk and refine codes

* add cls export and test support

* fix lint

* move ort into wrapper

* resolve conflicts

* resolve comments

* resolve conflicts

* resolve comments and padding mrcnn

* resolve comments

* Fix: [0, ...] tensor bug

* check the format

Co-authored-by: AllentDan <[email protected]>
Co-authored-by: zhouyifan <PJLAB\[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants