Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use O1 opt_lv leads to RuntimeError: CUDA error: no kernel image is available for execution on the device #842

Closed
matlabninja opened this issue May 20, 2020 · 1 comment

Comments

@matlabninja
Copy link

matlabninja commented May 20, 2020

Using apex in a docker container with CUDA 10.1, cudnn 7.6.5.32, ubuntu18.04, and pytorch 1.4.0 (derived from nvidia/cuda:10.1-cudnn7-devel-ubuntu18.04), I'm met with the title error during the scaled_loss.backward() call. The docker container is being scheduled via Kubernetes on a DGX-1. Following the advice in #528, I've set the environment variable for TORCH_CUDA_ARCH_LIST to include cc 7.0 before installing.

From the dockerfile

WORKDIR /apex-master
RUN export TORCH_CUDA_ARCH_LIST="6.0;7.0"
RUN pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

I'm including 6.0 here, as we have some P100 nodes available as well. The job that I am running appears to do fine with opt_lv O0 on the DGX in the container. It also was doing fine running on the localhost environment (non-docker, P100 node) with both opt_lv O0 and O1.

Full stack trace:
Selected optimization level O1: Insert automatic casts around Pytorch functions and Tensor methods.
Defaults for this optimization level are:
enabled : True
opt_level : O1
cast_model_type : None
patch_torch_functions : True
keep_batchnorm_fp32 : None
master_weights : None
loss_scale : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled : True
opt_level : O1
cast_model_type : None
patch_torch_functions : True
keep_batchnorm_fp32 : None
master_weights : None
loss_scale : dynamic
Traceback (most recent call last):
File "segFp16.py", line 300, in
scaled_loss.backward()
File "/usr/lib/python3.6/contextlib.py", line 88, in exit
next(self.gen)
File "/usr/local/lib/python3.6/dist-packages/apex/amp/handle.py", line 123, in scale_loss
optimizer._post_amp_backward(loss_scaler)
File "/usr/local/lib/python3.6/dist-packages/apex/amp/_process_optimizer.py", line 249, in post_backward_no_master_weights
post_backward_models_are_masters(scaler, params, stashed_grads)
File "/usr/local/lib/python3.6/dist-packages/apex/amp/_process_optimizer.py", line 128, in post_backward_models_are_masters
scale_override=grads_have_scale/out_scale)
File "/usr/local/lib/python3.6/dist-packages/apex/amp/scaler.py", line 117, in unscale
1./scale)
File "/usr/local/lib/python3.6/dist-packages/apex/multi_tensor_apply/multi_tensor_apply.py", line 30, in call
*args)
RuntimeError: CUDA error: no kernel image is available for execution on the device (multi_tensor_apply at csrc/multi_tensor_apply.cuh:108)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x7fbff3dbe193 in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so)
frame #1: void multi_tensor_apply<2, ScaleFunctor<float, float>, float>(int, int, at::Tensor const&, std::vector<std::vector<at::Tensor, std::allocatorat::Tensor >, std::allocator<std::vector<at::Tensor, std::allocatorat::Tensor > > > const&, ScaleFunctor<float, float>, float) + 0xf83 (0x7fbfc7ae9fe3 in /usr/local/lib/python3.6/dist-packages/amp_C.cpython-36m-x86_64-linux-gnu.so)
frame #2: multi_tensor_scale_cuda(int, at::Tensor, std::vector<std::vector<at::Tensor, std::allocatorat::Tensor >, std::allocator<std::vector<at::Tensor, std::allocatorat::Tensor > > >, float) + 0xcfe (0x7fbfc7ae7a8e in /usr/local/lib/python3.6/dist-packages/amp_C.cpython-36m-x86_64-linux-gnu.so)
frame #3: + 0x20627 (0x7fbfc7adc627 in /usr/local/lib/python3.6/dist-packages/amp_C.cpython-36m-x86_64-linux-gnu.so)
frame #4: + 0x1af8c (0x7fbfc7ad6f8c in /usr/local/lib/python3.6/dist-packages/amp_C.cpython-36m-x86_64-linux-gnu.so)

frame #7: python3() [0x5081d5]
frame #9: python3() [0x5951c1]
frame #10: python3() [0x54ac01]
frame #12: python3() [0x50ab53]
frame #14: python3() [0x5081d5]
frame #15: python3() [0x50a020]
frame #16: python3() [0x50aa1d]
frame #18: python3() [0x5081d5]
frame #19: python3() [0x50a020]
frame #20: python3() [0x50aa1d]
frame #22: python3() [0x509ce8]
frame #23: python3() [0x50aa1d]
frame #25: python3() [0x58ee33]
frame #26: python3() [0x51412f]
frame #27: python3() [0x50a84f]
frame #30: python3() [0x5951c1]
frame #34: python3() [0x5081d5]
frame #36: python3() [0x635082]
frame #41: __libc_start_main + 0xe7 (0x7fbff96f2b97 in /lib/x86_64-linux-gnu/libc.so.6)

@matlabninja
Copy link
Author

Update: I realized my mistake with
RUN export TORCH_CUDA_ARCH_LIST="6.0;7.0"

I have updated my Dockerfile to
ENV TORCH_CUDA_ARCH_LIST 6.0;7.0

Now AMP works as expected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant