Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

iou3d failed when inference with gpu:1 #65

Closed
YeungLy opened this issue Aug 17, 2020 · 6 comments · Fixed by #69
Closed

iou3d failed when inference with gpu:1 #65

YeungLy opened this issue Aug 17, 2020 · 6 comments · Fixed by #69
Assignees
Labels
bug Something isn't working P0

Comments

@YeungLy
Copy link

YeungLy commented Aug 17, 2020

Thanks for your error report and we appreciate it a lot.

Checklist

  1. I have searched related issues but cannot get the expected help.
  2. The bug has not been fixed in the latest version.

Describe the bug
Training on single GPU, when using default gpu (gpu:0) , everything is ok.
Switch to gpu:1, report an illegal memory access was encountered mmdet3d/ops/iou3d/src/iou3d.cpp 121 during inference, however training is ok.

Reproduction

  1. What command or script did you run?
python tools/train.py CONFIG_PATH --gpu-ids 1
  1. Did you make any modifications on the code or config? Did you understand what you have modified?
  2. What dataset did you use?
  • kitti

Environment

  1. Please run python mmdet3d/utils/collect_env.py to collect necessary environment infomation and paste it here.
  2. You may add addition that may be helpful for locating the problem, such as
    • How you installed PyTorch [e.g., pip, conda, source]
    • Other environment variables that may be related (such as $PATH, $LD_LIBRARY_PATH, $PYTHONPATH, etc.)

Error traceback
If applicable, paste the error trackback here.

A placeholder for trackback.

Bug fix
If you have already identified the reason, you can provide the information here. If you are willing to create a PR to fix it, please also leave a comment here and that would be much appreciated!

@Tai-Wang
Copy link
Member

How did you swtich to gpu:1? From your description, it seems that either your model or some part of data (some tensors) was loaded to gpu:0. Perhaps you can set CUDA_VISIBLE_DEVICES when inferencing with different gpus.

@YeungLy
Copy link
Author

YeungLy commented Aug 17, 2020

Training with command
python tools/train.py configs/pointpillars/hv_pointpillars_secfpn_6x8_160e_kitti-3d-car.py --gpu-ids 1
to use gpu:1 instead of gpu:0.

Set CUDA_VISIBLE_DEVICES at script tools/train.py by os.environ['CUDA_VISIBLE_DEVICES']=str(gpu_ids) is ok.

@Tai-Wang
Copy link
Member

Tai-Wang commented Aug 17, 2020

I have just tried to reproduce your case, but everything is ok on my machine. I both tried CUDA_VISIBLE_DEVICES=1 python tools/test.py ... and os.environ['CUDA_VISIBLE_DEVICES']=str(1) in the test.py with my model trained on cuda: 0.

Have you ever succeeded to inference with the same gpu? Besides, your environment information and detailed error traceback may help us debug.

@YeungLy
Copy link
Author

YeungLy commented Aug 17, 2020

When I train a model with command python tools/train.py ... --gpu-ids 0 , it is normal for training and evaluation interval. So training and inference with the same gpu is successful.

I read your suggestion then export CUDA_VISIBLE_DEVICES='1' and run python tools/train.py ... --gpu-ids 0, the problem is solved.

And I tried your way: export CUDA_VISIBLE_DEVICES=1 , then python tools/test.py ... with model trained on cuda: 0. It also worked at my machine. However, in this case, the training and testing is seperate process, maybe you can try to just run python tools/train.py ... --gpu-ids 1 to reproduce.

I'm not sure if there is something wrong during training makes it train the model with gpu:1 and then evaluate it with gpu:0(which is default gpu if CUDA_VISIBLE_DEVICE wasn't modified)

Environment:

sys.platform: linux
Python: 3.6.7 | packaged by conda-forge | (default, Jul 2 2019, 02:18:42) [GCC 7.3.0]
CUDA available: True
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 10.0, V10.0.130
GPU 0,1,2,3: TITAN RTX
GCC: gcc (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609
PyTorch: 1.5.1
PyTorch compiling details: PyTorch built with:

  • GCC 7.3
  • C++ Version: 201402
  • Intel(R) Math Kernel Library Version 2019.0.5 Product Build 20190808 for Intel(R) 64 architecture applications
  • Intel(R) MKL-DNN v0.21.1 (Git Hash 7d2fd500bc78936d1d648ca713b901012f470dbc)
  • OpenMP 201511 (a.k.a. OpenMP 4.5)
  • NNPACK is enabled
  • CPU capability usage: AVX2
  • CUDA Runtime 10.2
  • NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_37,code=compute_37
  • CuDNN 7.6.5
  • Magma 2.5.2
  • Build settings: BLAS=MKL, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -fopenmp -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DUSE_INTERNAL_THREADPOOL_IMPL -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_STATIC_DISPATCH=OFF,

TorchVision: 0.6.1
OpenCV: 4.1.2
MMCV: 1.0.4
MMDetection: 2.3.0rc0+d06b352
MMDetection3D: 0.5.0+6526459
MMDetection3D Compiler: GCC 5.4
MMDetection3D CUDA Compiler: 10.0

Error traceback:
This error occured when I didn't modify 'CUDA_VISIBLE_DEVICES' and just run python tools/train.py ... --gpu-ids 1 :
.....
2020-08-17 15:29:03,741 - mmdet - INFO - Epoch [2][900/1238] lr: 1.064e-03, eta: 8:39:17, time: 0.313, data_time: 0.010, memory: 5390, loss_cls: 0.1929, loss_bbox: 0.4137, loss_dir: 0.0672, loss: 0.6737, grad_norm: 5.6379
2020-08-17 15:29:19,667 - mmdet - INFO - Epoch [2][950/1238] lr: 1.068e-03, eta: 8:38:55, time: 0.318, data_time: 0.010, memory: 5390, loss_cls: 0.1989, loss_bbox: 0.4038, loss_dir: 0.0631, loss: 0.6658, grad_norm: 5.4268
2020-08-17 15:29:35,242 - mmdet - INFO - Epoch [2][1000/1238] lr: 1.071e-03, eta: 8:38:17, time: 0.312, data_time: 0.010, memory: 5390, loss_cls: 0.1826, loss_bbox: 0.3878, loss_dir: 0.0632, loss: 0.6336, grad_norm: 5.7179
2020-08-17 15:29:51,103 - mmdet - INFO - Epoch [2][1050/1238] lr: 1.074e-03, eta: 8:37:52, time: 0.317, data_time: 0.010, memory: 5390, loss_cls: 0.1891, loss_bbox: 0.3888, loss_dir: 0.0657, loss: 0.6436, grad_norm: 5.7038
2020-08-17 15:30:07,046 - mmdet - INFO - Epoch [2][1100/1238] lr: 1.077e-03, eta: 8:37:32, time: 0.319, data_time: 0.011, memory: 5390, loss_cls: 0.1867, loss_bbox: 0.3993, loss_dir: 0.0631, loss: 0.6491, grad_norm: 5.2929
2020-08-17 15:30:22,746 - mmdet - INFO - Epoch [2][1150/1238] lr: 1.080e-03, eta: 8:37:01, time: 0.314, data_time: 0.009, memory: 5390, loss_cls: 0.1909, loss_bbox: 0.3978, loss_dir: 0.0609, loss: 0.6497, grad_norm: 5.5580
2020-08-17 15:30:38,672 - mmdet - INFO - Epoch [2][1200/1238] lr: 1.084e-03, eta: 8:36:40, time: 0.319, data_time: 0.010, memory: 5390, loss_cls: 0.1796, loss_bbox: 0.3819, loss_dir: 0.0587, loss
: 0.6202, grad_norm: 4.9306
2020-08-17 15:30:50,774 - mmdet - INFO - Saving checkpoint at 2 epochs
[ ] 0/3769, elapsed: 0s, ETA:GPUassert: an illegal memory access was encountered mmdet3d/ops/iou3d/src/iou3d.cpp 121

Config:
I was using default config file provided at this repo: configs/pointpillars/hv_pointpillars_secfpn_6x8_160e_kitti-3d-car.py

@ZwwWayne
Copy link
Collaborator

This indicates some bug exists in the code that is not device agnostic. We will create a PR to fix this bug ASAP.

@ZwwWayne ZwwWayne added bug Something isn't working P0 labels Aug 17, 2020
@Tai-Wang
Copy link
Member

When I train a model with command python tools/train.py ... --gpu-ids 0 , it is normal for training and evaluation interval. So training and inference with the same gpu is successful.

I read your suggestion then export CUDA_VISIBLE_DEVICES='1' and run python tools/train.py ... --gpu-ids 0, the problem is solved.

And I tried your way: export CUDA_VISIBLE_DEVICES=1 , then python tools/test.py ... with model trained on cuda: 0. It also worked at my machine. However, in this case, the training and testing is seperate process, maybe you can try to just run python tools/train.py ... --gpu-ids 1 to reproduce.

I'm not sure if there is something wrong during training makes it train the model with gpu:1 and then evaluate it with gpu:0(which is default gpu if CUDA_VISIBLE_DEVICE wasn't modified)

Environment:

sys.platform: linux
Python: 3.6.7 | packaged by conda-forge | (default, Jul 2 2019, 02:18:42) [GCC 7.3.0]
CUDA available: True
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 10.0, V10.0.130
GPU 0,1,2,3: TITAN RTX
GCC: gcc (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609
PyTorch: 1.5.1
PyTorch compiling details: PyTorch built with:

  • GCC 7.3
  • C++ Version: 201402
  • Intel(R) Math Kernel Library Version 2019.0.5 Product Build 20190808 for Intel(R) 64 architecture applications
  • Intel(R) MKL-DNN v0.21.1 (Git Hash 7d2fd500bc78936d1d648ca713b901012f470dbc)
  • OpenMP 201511 (a.k.a. OpenMP 4.5)
  • NNPACK is enabled
  • CPU capability usage: AVX2
  • CUDA Runtime 10.2
  • NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_37,code=compute_37
  • CuDNN 7.6.5
  • Magma 2.5.2
  • Build settings: BLAS=MKL, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -fopenmp -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DUSE_INTERNAL_THREADPOOL_IMPL -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_STATIC_DISPATCH=OFF,

TorchVision: 0.6.1
OpenCV: 4.1.2
MMCV: 1.0.4
MMDetection: 2.3.0rc0+d06b352
MMDetection3D: 0.5.0+6526459
MMDetection3D Compiler: GCC 5.4
MMDetection3D CUDA Compiler: 10.0

Error traceback:
This error occured when I didn't modify 'CUDA_VISIBLE_DEVICES' and just run python tools/train.py ... --gpu-ids 1 :
.....
2020-08-17 15:29:03,741 - mmdet - INFO - Epoch [2][900/1238] lr: 1.064e-03, eta: 8:39:17, time: 0.313, data_time: 0.010, memory: 5390, loss_cls: 0.1929, loss_bbox: 0.4137, loss_dir: 0.0672, loss: 0.6737, grad_norm: 5.6379
2020-08-17 15:29:19,667 - mmdet - INFO - Epoch [2][950/1238] lr: 1.068e-03, eta: 8:38:55, time: 0.318, data_time: 0.010, memory: 5390, loss_cls: 0.1989, loss_bbox: 0.4038, loss_dir: 0.0631, loss: 0.6658, grad_norm: 5.4268
2020-08-17 15:29:35,242 - mmdet - INFO - Epoch [2][1000/1238] lr: 1.071e-03, eta: 8:38:17, time: 0.312, data_time: 0.010, memory: 5390, loss_cls: 0.1826, loss_bbox: 0.3878, loss_dir: 0.0632, loss: 0.6336, grad_norm: 5.7179
2020-08-17 15:29:51,103 - mmdet - INFO - Epoch [2][1050/1238] lr: 1.074e-03, eta: 8:37:52, time: 0.317, data_time: 0.010, memory: 5390, loss_cls: 0.1891, loss_bbox: 0.3888, loss_dir: 0.0657, loss: 0.6436, grad_norm: 5.7038
2020-08-17 15:30:07,046 - mmdet - INFO - Epoch [2][1100/1238] lr: 1.077e-03, eta: 8:37:32, time: 0.319, data_time: 0.011, memory: 5390, loss_cls: 0.1867, loss_bbox: 0.3993, loss_dir: 0.0631, loss: 0.6491, grad_norm: 5.2929
2020-08-17 15:30:22,746 - mmdet - INFO - Epoch [2][1150/1238] lr: 1.080e-03, eta: 8:37:01, time: 0.314, data_time: 0.009, memory: 5390, loss_cls: 0.1909, loss_bbox: 0.3978, loss_dir: 0.0609, loss: 0.6497, grad_norm: 5.5580
2020-08-17 15:30:38,672 - mmdet - INFO - Epoch [2][1200/1238] lr: 1.084e-03, eta: 8:36:40, time: 0.319, data_time: 0.010, memory: 5390, loss_cls: 0.1796, loss_bbox: 0.3819, loss_dir: 0.0587, loss
: 0.6202, grad_norm: 4.9306
2020-08-17 15:30:50,774 - mmdet - INFO - Saving checkpoint at 2 epochs
[ ] 0/3769, elapsed: 0s, ETA:GPUassert: an illegal memory access was encountered mmdet3d/ops/iou3d/src/iou3d.cpp 121

Config:
I was using default config file provided at this repo: configs/pointpillars/hv_pointpillars_secfpn_6x8_160e_kitti-3d-car.py

This bug is a little subtle. It is caused by incorrect memory allocation when creating new tensors in the iou3d.cpp file, which is an operation borrowed from other codebase. We have fixed it with setting specific cuda device ids in the procedure. You can refer to the commit for more details. Please feel free to reopen this issue if you have any other questions.

tpoisonooo pushed a commit to tpoisonooo/mmdetection3d that referenced this issue Sep 5, 2022
* add export info

* add dump-info funciton

* add collect info

* fix lint

* add docstring

* docstring

* docstring
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants