iou3d failed when inference with gpu:1 #65

YeungLy · 2020-08-17T04:42:26Z

Thanks for your error report and we appreciate it a lot.

Checklist

I have searched related issues but cannot get the expected help.
The bug has not been fixed in the latest version.

Describe the bug
Training on single GPU, when using default gpu (gpu:0) , everything is ok.
Switch to gpu:1, report an illegal memory access was encountered mmdet3d/ops/iou3d/src/iou3d.cpp 121 during inference, however training is ok.

Reproduction

What command or script did you run?

python tools/train.py CONFIG_PATH --gpu-ids 1

Did you make any modifications on the code or config? Did you understand what you have modified?
What dataset did you use?

kitti

Environment

Please run python mmdet3d/utils/collect_env.py to collect necessary environment infomation and paste it here.
You may add addition that may be helpful for locating the problem, such as
- How you installed PyTorch [e.g., pip, conda, source]
- Other environment variables that may be related (such as $PATH, $LD_LIBRARY_PATH, $PYTHONPATH, etc.)

Error traceback
If applicable, paste the error trackback here.

A placeholder for trackback.

Bug fix
If you have already identified the reason, you can provide the information here. If you are willing to create a PR to fix it, please also leave a comment here and that would be much appreciated!

The text was updated successfully, but these errors were encountered:

Tai-Wang · 2020-08-17T04:53:56Z

How did you swtich to gpu:1? From your description, it seems that either your model or some part of data (some tensors) was loaded to gpu:0. Perhaps you can set CUDA_VISIBLE_DEVICES when inferencing with different gpus.

YeungLy · 2020-08-17T07:41:09Z

Training with command
python tools/train.py configs/pointpillars/hv_pointpillars_secfpn_6x8_160e_kitti-3d-car.py --gpu-ids 1
to use gpu:1 instead of gpu:0.

Set CUDA_VISIBLE_DEVICES at script tools/train.py by os.environ['CUDA_VISIBLE_DEVICES']=str(gpu_ids) is ok.

Tai-Wang · 2020-08-17T08:36:54Z

I have just tried to reproduce your case, but everything is ok on my machine. I both tried CUDA_VISIBLE_DEVICES=1 python tools/test.py ... and os.environ['CUDA_VISIBLE_DEVICES']=str(1) in the test.py with my model trained on cuda: 0.

Have you ever succeeded to inference with the same gpu? Besides, your environment information and detailed error traceback may help us debug.

YeungLy · 2020-08-17T09:15:00Z

When I train a model with command python tools/train.py ... --gpu-ids 0 , it is normal for training and evaluation interval. So training and inference with the same gpu is successful.

I read your suggestion then export CUDA_VISIBLE_DEVICES='1' and run python tools/train.py ... --gpu-ids 0, the problem is solved.

And I tried your way: export CUDA_VISIBLE_DEVICES=1 , then python tools/test.py ... with model trained on cuda: 0. It also worked at my machine. However, in this case, the training and testing is seperate process, maybe you can try to just run python tools/train.py ... --gpu-ids 1 to reproduce.

I'm not sure if there is something wrong during training makes it train the model with gpu:1 and then evaluate it with gpu:0(which is default gpu if CUDA_VISIBLE_DEVICE wasn't modified)

Environment:

sys.platform: linux
Python: 3.6.7 | packaged by conda-forge | (default, Jul 2 2019, 02:18:42) [GCC 7.3.0]
CUDA available: True
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 10.0, V10.0.130
GPU 0,1,2,3: TITAN RTX
GCC: gcc (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609
PyTorch: 1.5.1
PyTorch compiling details: PyTorch built with:

GCC 7.3
C++ Version: 201402
Intel(R) Math Kernel Library Version 2019.0.5 Product Build 20190808 for Intel(R) 64 architecture applications
Intel(R) MKL-DNN v0.21.1 (Git Hash 7d2fd500bc78936d1d648ca713b901012f470dbc)
OpenMP 201511 (a.k.a. OpenMP 4.5)
NNPACK is enabled
CPU capability usage: AVX2
CUDA Runtime 10.2
NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_37,code=compute_37
CuDNN 7.6.5
Magma 2.5.2
Build settings: BLAS=MKL, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -fopenmp -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DUSE_INTERNAL_THREADPOOL_IMPL -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_STATIC_DISPATCH=OFF,

TorchVision: 0.6.1
OpenCV: 4.1.2
MMCV: 1.0.4
MMDetection: 2.3.0rc0+d06b352
MMDetection3D: 0.5.0+6526459
MMDetection3D Compiler: GCC 5.4
MMDetection3D CUDA Compiler: 10.0

Error traceback:
This error occured when I didn't modify 'CUDA_VISIBLE_DEVICES' and just run python tools/train.py ... --gpu-ids 1 :
.....
2020-08-17 15:29:03,741 - mmdet - INFO - Epoch [2][900/1238] lr: 1.064e-03, eta: 8:39:17, time: 0.313, data_time: 0.010, memory: 5390, loss_cls: 0.1929, loss_bbox: 0.4137, loss_dir: 0.0672, loss: 0.6737, grad_norm: 5.6379
2020-08-17 15:29:19,667 - mmdet - INFO - Epoch [2][950/1238] lr: 1.068e-03, eta: 8:38:55, time: 0.318, data_time: 0.010, memory: 5390, loss_cls: 0.1989, loss_bbox: 0.4038, loss_dir: 0.0631, loss: 0.6658, grad_norm: 5.4268
2020-08-17 15:29:35,242 - mmdet - INFO - Epoch [2][1000/1238] lr: 1.071e-03, eta: 8:38:17, time: 0.312, data_time: 0.010, memory: 5390, loss_cls: 0.1826, loss_bbox: 0.3878, loss_dir: 0.0632, loss: 0.6336, grad_norm: 5.7179
2020-08-17 15:29:51,103 - mmdet - INFO - Epoch [2][1050/1238] lr: 1.074e-03, eta: 8:37:52, time: 0.317, data_time: 0.010, memory: 5390, loss_cls: 0.1891, loss_bbox: 0.3888, loss_dir: 0.0657, loss: 0.6436, grad_norm: 5.7038
2020-08-17 15:30:07,046 - mmdet - INFO - Epoch [2][1100/1238] lr: 1.077e-03, eta: 8:37:32, time: 0.319, data_time: 0.011, memory: 5390, loss_cls: 0.1867, loss_bbox: 0.3993, loss_dir: 0.0631, loss: 0.6491, grad_norm: 5.2929
2020-08-17 15:30:22,746 - mmdet - INFO - Epoch [2][1150/1238] lr: 1.080e-03, eta: 8:37:01, time: 0.314, data_time: 0.009, memory: 5390, loss_cls: 0.1909, loss_bbox: 0.3978, loss_dir: 0.0609, loss: 0.6497, grad_norm: 5.5580
2020-08-17 15:30:38,672 - mmdet - INFO - Epoch [2][1200/1238] lr: 1.084e-03, eta: 8:36:40, time: 0.319, data_time: 0.010, memory: 5390, loss_cls: 0.1796, loss_bbox: 0.3819, loss_dir: 0.0587, loss
: 0.6202, grad_norm: 4.9306
2020-08-17 15:30:50,774 - mmdet - INFO - Saving checkpoint at 2 epochs
[ ] 0/3769, elapsed: 0s, ETA:GPUassert: an illegal memory access was encountered mmdet3d/ops/iou3d/src/iou3d.cpp 121

Config:
I was using default config file provided at this repo: configs/pointpillars/hv_pointpillars_secfpn_6x8_160e_kitti-3d-car.py

ZwwWayne · 2020-08-17T14:17:48Z

This indicates some bug exists in the code that is not device agnostic. We will create a PR to fix this bug ASAP.

Tai-Wang · 2020-08-19T15:36:37Z

When I train a model with command python tools/train.py ... --gpu-ids 0 , it is normal for training and evaluation interval. So training and inference with the same gpu is successful.

I read your suggestion then export CUDA_VISIBLE_DEVICES='1' and run python tools/train.py ... --gpu-ids 0, the problem is solved.

And I tried your way: export CUDA_VISIBLE_DEVICES=1 , then python tools/test.py ... with model trained on cuda: 0. It also worked at my machine. However, in this case, the training and testing is seperate process, maybe you can try to just run python tools/train.py ... --gpu-ids 1 to reproduce.

I'm not sure if there is something wrong during training makes it train the model with gpu:1 and then evaluate it with gpu:0(which is default gpu if CUDA_VISIBLE_DEVICE wasn't modified)

Environment:

sys.platform: linux
Python: 3.6.7 | packaged by conda-forge | (default, Jul 2 2019, 02:18:42) [GCC 7.3.0]
CUDA available: True
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 10.0, V10.0.130
GPU 0,1,2,3: TITAN RTX
GCC: gcc (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609
PyTorch: 1.5.1
PyTorch compiling details: PyTorch built with:

GCC 7.3

C++ Version: 201402

Intel(R) Math Kernel Library Version 2019.0.5 Product Build 20190808 for Intel(R) 64 architecture applications

Intel(R) MKL-DNN v0.21.1 (Git Hash 7d2fd500bc78936d1d648ca713b901012f470dbc)

OpenMP 201511 (a.k.a. OpenMP 4.5)

NNPACK is enabled

CPU capability usage: AVX2

CUDA Runtime 10.2

NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_37,code=compute_37

CuDNN 7.6.5

Magma 2.5.2

Build settings: BLAS=MKL, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -fopenmp -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DUSE_INTERNAL_THREADPOOL_IMPL -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_STATIC_DISPATCH=OFF,

TorchVision: 0.6.1
OpenCV: 4.1.2
MMCV: 1.0.4
MMDetection: 2.3.0rc0+d06b352
MMDetection3D: 0.5.0+6526459
MMDetection3D Compiler: GCC 5.4
MMDetection3D CUDA Compiler: 10.0

Error traceback:
This error occured when I didn't modify 'CUDA_VISIBLE_DEVICES' and just run python tools/train.py ... --gpu-ids 1 :
.....
2020-08-17 15:29:03,741 - mmdet - INFO - Epoch [2][900/1238] lr: 1.064e-03, eta: 8:39:17, time: 0.313, data_time: 0.010, memory: 5390, loss_cls: 0.1929, loss_bbox: 0.4137, loss_dir: 0.0672, loss: 0.6737, grad_norm: 5.6379
2020-08-17 15:29:19,667 - mmdet - INFO - Epoch [2][950/1238] lr: 1.068e-03, eta: 8:38:55, time: 0.318, data_time: 0.010, memory: 5390, loss_cls: 0.1989, loss_bbox: 0.4038, loss_dir: 0.0631, loss: 0.6658, grad_norm: 5.4268
2020-08-17 15:29:35,242 - mmdet - INFO - Epoch [2][1000/1238] lr: 1.071e-03, eta: 8:38:17, time: 0.312, data_time: 0.010, memory: 5390, loss_cls: 0.1826, loss_bbox: 0.3878, loss_dir: 0.0632, loss: 0.6336, grad_norm: 5.7179
2020-08-17 15:29:51,103 - mmdet - INFO - Epoch [2][1050/1238] lr: 1.074e-03, eta: 8:37:52, time: 0.317, data_time: 0.010, memory: 5390, loss_cls: 0.1891, loss_bbox: 0.3888, loss_dir: 0.0657, loss: 0.6436, grad_norm: 5.7038
2020-08-17 15:30:07,046 - mmdet - INFO - Epoch [2][1100/1238] lr: 1.077e-03, eta: 8:37:32, time: 0.319, data_time: 0.011, memory: 5390, loss_cls: 0.1867, loss_bbox: 0.3993, loss_dir: 0.0631, loss: 0.6491, grad_norm: 5.2929
2020-08-17 15:30:22,746 - mmdet - INFO - Epoch [2][1150/1238] lr: 1.080e-03, eta: 8:37:01, time: 0.314, data_time: 0.009, memory: 5390, loss_cls: 0.1909, loss_bbox: 0.3978, loss_dir: 0.0609, loss: 0.6497, grad_norm: 5.5580
2020-08-17 15:30:38,672 - mmdet - INFO - Epoch [2][1200/1238] lr: 1.084e-03, eta: 8:36:40, time: 0.319, data_time: 0.010, memory: 5390, loss_cls: 0.1796, loss_bbox: 0.3819, loss_dir: 0.0587, loss
: 0.6202, grad_norm: 4.9306
2020-08-17 15:30:50,774 - mmdet - INFO - Saving checkpoint at 2 epochs
[ ] 0/3769, elapsed: 0s, ETA:GPUassert: an illegal memory access was encountered mmdet3d/ops/iou3d/src/iou3d.cpp 121

Config:
I was using default config file provided at this repo: configs/pointpillars/hv_pointpillars_secfpn_6x8_160e_kitti-3d-car.py

This bug is a little subtle. It is caused by incorrect memory allocation when creating new tensors in the iou3d.cpp file, which is an operation borrowed from other codebase. We have fixed it with setting specific cuda device ids in the procedure. You can refer to the commit for more details. Please feel free to reopen this issue if you have any other questions.

* add export info * add dump-info funciton * add collect info * fix lint * add docstring * docstring * docstring

ZwwWayne assigned Tai-Wang Aug 17, 2020

ZwwWayne added bug Something isn't working P0 labels Aug 17, 2020

Tai-Wang mentioned this issue Aug 19, 2020

Fix minor bugs of computing iou3d #69

Merged

ZwwWayne closed this as completed Aug 19, 2020

tpoisonooo pushed a commit to tpoisonooo/mmdetection3d that referenced this issue Sep 5, 2022

[Feture] Export preprocess and deploy information to SDK (open-mmlab#65)

10793f4

* add export info * add dump-info funciton * add collect info * fix lint * add docstring * docstring * docstring

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

iou3d failed when inference with gpu:1 #65

iou3d failed when inference with gpu:1 #65

YeungLy commented Aug 17, 2020

Tai-Wang commented Aug 17, 2020

YeungLy commented Aug 17, 2020 •

edited

Loading

Tai-Wang commented Aug 17, 2020 •

edited

Loading

YeungLy commented Aug 17, 2020

ZwwWayne commented Aug 17, 2020

Tai-Wang commented Aug 19, 2020

iou3d failed when inference with gpu:1 #65

iou3d failed when inference with gpu:1 #65

Comments

YeungLy commented Aug 17, 2020

Tai-Wang commented Aug 17, 2020

YeungLy commented Aug 17, 2020 • edited Loading

Tai-Wang commented Aug 17, 2020 • edited Loading

YeungLy commented Aug 17, 2020

ZwwWayne commented Aug 17, 2020

Tai-Wang commented Aug 19, 2020

YeungLy commented Aug 17, 2020 •

edited

Loading

Tai-Wang commented Aug 17, 2020 •

edited

Loading