You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Using apex in a docker container with CUDA 10.1, cudnn 7.6.5.32, ubuntu18.04, and pytorch 1.4.0 (derived from nvidia/cuda:10.1-cudnn7-devel-ubuntu18.04), I'm met with the title error during the scaled_loss.backward() call. The docker container is being scheduled via Kubernetes on a DGX-1. Following the advice in #528, I've set the environment variable for TORCH_CUDA_ARCH_LIST to include cc 7.0 before installing.
From the dockerfile
WORKDIR /apex-master
RUN export TORCH_CUDA_ARCH_LIST="6.0;7.0"
RUN pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
I'm including 6.0 here, as we have some P100 nodes available as well. The job that I am running appears to do fine with opt_lv O0 on the DGX in the container. It also was doing fine running on the localhost environment (non-docker, P100 node) with both opt_lv O0 and O1.
Full stack trace:
Selected optimization level O1: Insert automatic casts around Pytorch functions and Tensor methods.
Defaults for this optimization level are:
enabled : True
opt_level : O1
cast_model_type : None
patch_torch_functions : True
keep_batchnorm_fp32 : None
master_weights : None
loss_scale : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled : True
opt_level : O1
cast_model_type : None
patch_torch_functions : True
keep_batchnorm_fp32 : None
master_weights : None
loss_scale : dynamic
Traceback (most recent call last):
File "segFp16.py", line 300, in
scaled_loss.backward()
File "/usr/lib/python3.6/contextlib.py", line 88, in exit
next(self.gen)
File "/usr/local/lib/python3.6/dist-packages/apex/amp/handle.py", line 123, in scale_loss
optimizer._post_amp_backward(loss_scaler)
File "/usr/local/lib/python3.6/dist-packages/apex/amp/_process_optimizer.py", line 249, in post_backward_no_master_weights
post_backward_models_are_masters(scaler, params, stashed_grads)
File "/usr/local/lib/python3.6/dist-packages/apex/amp/_process_optimizer.py", line 128, in post_backward_models_are_masters
scale_override=grads_have_scale/out_scale)
File "/usr/local/lib/python3.6/dist-packages/apex/amp/scaler.py", line 117, in unscale
1./scale)
File "/usr/local/lib/python3.6/dist-packages/apex/multi_tensor_apply/multi_tensor_apply.py", line 30, in call
*args)
RuntimeError: CUDA error: no kernel image is available for execution on the device (multi_tensor_apply at csrc/multi_tensor_apply.cuh:108)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x7fbff3dbe193 in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so)
frame #1: void multi_tensor_apply<2, ScaleFunctor<float, float>, float>(int, int, at::Tensor const&, std::vector<std::vector<at::Tensor, std::allocatorat::Tensor >, std::allocator<std::vector<at::Tensor, std::allocatorat::Tensor > > > const&, ScaleFunctor<float, float>, float) + 0xf83 (0x7fbfc7ae9fe3 in /usr/local/lib/python3.6/dist-packages/amp_C.cpython-36m-x86_64-linux-gnu.so)
frame #2: multi_tensor_scale_cuda(int, at::Tensor, std::vector<std::vector<at::Tensor, std::allocatorat::Tensor >, std::allocator<std::vector<at::Tensor, std::allocatorat::Tensor > > >, float) + 0xcfe (0x7fbfc7ae7a8e in /usr/local/lib/python3.6/dist-packages/amp_C.cpython-36m-x86_64-linux-gnu.so)
frame #3: + 0x20627 (0x7fbfc7adc627 in /usr/local/lib/python3.6/dist-packages/amp_C.cpython-36m-x86_64-linux-gnu.so)
frame #4: + 0x1af8c (0x7fbfc7ad6f8c in /usr/local/lib/python3.6/dist-packages/amp_C.cpython-36m-x86_64-linux-gnu.so)
Using apex in a docker container with CUDA 10.1, cudnn 7.6.5.32, ubuntu18.04, and pytorch 1.4.0 (derived from nvidia/cuda:10.1-cudnn7-devel-ubuntu18.04), I'm met with the title error during the scaled_loss.backward() call. The docker container is being scheduled via Kubernetes on a DGX-1. Following the advice in #528, I've set the environment variable for TORCH_CUDA_ARCH_LIST to include cc 7.0 before installing.
From the dockerfile
I'm including 6.0 here, as we have some P100 nodes available as well. The job that I am running appears to do fine with opt_lv O0 on the DGX in the container. It also was doing fine running on the localhost environment (non-docker, P100 node) with both opt_lv O0 and O1.
Full stack trace:
Selected optimization level O1: Insert automatic casts around Pytorch functions and Tensor methods.
Defaults for this optimization level are:
enabled : True
opt_level : O1
cast_model_type : None
patch_torch_functions : True
keep_batchnorm_fp32 : None
master_weights : None
loss_scale : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled : True
opt_level : O1
cast_model_type : None
patch_torch_functions : True
keep_batchnorm_fp32 : None
master_weights : None
loss_scale : dynamic
Traceback (most recent call last):
File "segFp16.py", line 300, in
scaled_loss.backward()
File "/usr/lib/python3.6/contextlib.py", line 88, in exit
next(self.gen)
File "/usr/local/lib/python3.6/dist-packages/apex/amp/handle.py", line 123, in scale_loss
optimizer._post_amp_backward(loss_scaler)
File "/usr/local/lib/python3.6/dist-packages/apex/amp/_process_optimizer.py", line 249, in post_backward_no_master_weights
post_backward_models_are_masters(scaler, params, stashed_grads)
File "/usr/local/lib/python3.6/dist-packages/apex/amp/_process_optimizer.py", line 128, in post_backward_models_are_masters
scale_override=grads_have_scale/out_scale)
File "/usr/local/lib/python3.6/dist-packages/apex/amp/scaler.py", line 117, in unscale
1./scale)
File "/usr/local/lib/python3.6/dist-packages/apex/multi_tensor_apply/multi_tensor_apply.py", line 30, in call
*args)
RuntimeError: CUDA error: no kernel image is available for execution on the device (multi_tensor_apply at csrc/multi_tensor_apply.cuh:108)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x7fbff3dbe193 in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so)
frame #1: void multi_tensor_apply<2, ScaleFunctor<float, float>, float>(int, int, at::Tensor const&, std::vector<std::vector<at::Tensor, std::allocatorat::Tensor >, std::allocator<std::vector<at::Tensor, std::allocatorat::Tensor > > > const&, ScaleFunctor<float, float>, float) + 0xf83 (0x7fbfc7ae9fe3 in /usr/local/lib/python3.6/dist-packages/amp_C.cpython-36m-x86_64-linux-gnu.so)
frame #2: multi_tensor_scale_cuda(int, at::Tensor, std::vector<std::vector<at::Tensor, std::allocatorat::Tensor >, std::allocator<std::vector<at::Tensor, std::allocatorat::Tensor > > >, float) + 0xcfe (0x7fbfc7ae7a8e in /usr/local/lib/python3.6/dist-packages/amp_C.cpython-36m-x86_64-linux-gnu.so)
frame #3: + 0x20627 (0x7fbfc7adc627 in /usr/local/lib/python3.6/dist-packages/amp_C.cpython-36m-x86_64-linux-gnu.so)
frame #4: + 0x1af8c (0x7fbfc7ad6f8c in /usr/local/lib/python3.6/dist-packages/amp_C.cpython-36m-x86_64-linux-gnu.so)
frame #7: python3() [0x5081d5]
frame #9: python3() [0x5951c1]
frame #10: python3() [0x54ac01]
frame #12: python3() [0x50ab53]
frame #14: python3() [0x5081d5]
frame #15: python3() [0x50a020]
frame #16: python3() [0x50aa1d]
frame #18: python3() [0x5081d5]
frame #19: python3() [0x50a020]
frame #20: python3() [0x50aa1d]
frame #22: python3() [0x509ce8]
frame #23: python3() [0x50aa1d]
frame #25: python3() [0x58ee33]
frame #26: python3() [0x51412f]
frame #27: python3() [0x50a84f]
frame #30: python3() [0x5951c1]
frame #34: python3() [0x5081d5]
frame #36: python3() [0x635082]
frame #41: __libc_start_main + 0xe7 (0x7fbff96f2b97 in /lib/x86_64-linux-gnu/libc.so.6)
The text was updated successfully, but these errors were encountered: