Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FusedLayerNorm leads to RuntimeError: CUDA error: no kernel image is available for execution on the device #605

Open
yangkky opened this issue Nov 15, 2019 · 4 comments

Comments

@yangkky
Copy link

yangkky commented Nov 15, 2019

After a GPU tensor goes through FusedLayerNorm, the next time that memory is accessed I get a RuntimeError: CUDA error: no kernel image is available for execution on the device.

To reproduce:

import torch
from apex.normalization import FusedLayerNorm

norm = FusedLayerNorm(16)
device = torch.device('cuda:' + str(0))
norm = norm.to(device)

x = torch.randn(3, 4, 16)
x = x.to(device)
attended = norm(x)
print(x)

Other operations on attended or x will also raise the error. However, if I move x to the CPU, I can then proceed to use it without any problems.

I'm running this on an AWS p3.2xlarge instance based on the AWS Deep Learning AMI (Ubuntu 18.04) Version 25.0. We've updated pytorch to 1.3.0, and installed GPUtil, Apex, and gpustat using the following commands:

source activate pytorch_p36

# Update to the latest PyTorch 1.3 (but not CUDA 10.0 instead of 10.1, because the AMI/env doesn't have it installed)
conda install pytorch==1.3.0 torchvision==0.4.1 cudatoolkit=10.0 -c pytorch -y

# Install GPUtil
pip install GPUtil

# Install NVIDIA Apex
git clone https://github.com/NVIDIA/apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

# Install gpustat
pip install gpustat

Doing the same thing on an aws p2.xlarge instance with the same changes to the environment does not cause the error.

@mcarilli
Copy link
Contributor

Recent changes to Pytorch's built-in extension builder sometimes lead to it compiling for a different architecture. Try explicitly setting the list of compute capabilities you want to target by saying e.g.

$ export TORCH_CUDA_ARCH_LIST="6.0;6.1;6.2;7.0;7.5"
$ pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

6.0...7.5 may be trimmed down to only the compute capabilities you know you want to target. For example, if you will only run on Voltas, export TORCH_CUDA_ARCH_LIST="7.0"

@mcarilli
Copy link
Contributor

Also, we are in the process of evaluating Pytorch's native layer norm and upstreaming Apex's implementation if necessary, so for future-proofing, I recommend just using the native Pytorch layernorm.

@yangkky
Copy link
Author

yangkky commented Nov 15, 2019

Explicitly setting the architectures seems to fix it.

Out of curiosity, is there a place that lists what each of those architectures is?

@wseaton
Copy link

wseaton commented Nov 27, 2019

@yangkky and for anyone from the future, the CUDA Wikipedia page has a good feature table that can help you figure out how to pin TORCH_CUDA_ARCH_LIST.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants