Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

upfirdn2d_plugin Problem #39

Closed
ghost opened this issue Feb 17, 2021 · 46 comments
Closed

upfirdn2d_plugin Problem #39

ghost opened this issue Feb 17, 2021 · 46 comments

Comments

@ghost
Copy link

ghost commented Feb 17, 2021

Describe the bug
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!

Please stop closing people's issues without a confirmed fix for this problem. #2 (comment) does not work and there is no confirmed fix on that issue that was closed without a confirmed fix.

Please be serious about it and let's work together for a fix instead of ignoring the problem and referring people to a close topic that does not offer any solution to their problem.

We tried everything proposed we also tried both Cuda 11.0 and 11.1, with different version of PyTorch just in case.
We are a team of 5 people and we all had the same problem in both Windows and Linux machine and even in google Collab which tells me that this is more than just a configuration problem.

and no %pip install ninja did not solve the problem in any of the machines we have in our lab.
also, using verbosity = 'full' does not seem to include any additional helpful information.

Desktop (please complete the following information):

Those are the two machines I used

Machine 1

  • ubuntu 20.04.1,
  • pytorch 1.7.1
  • CUDA 11.1,
  • RTX 3090

Machine 2

  • Windows 10
  • pytorch 1.7.1
  • CUDA 11.1, also tried with - CUDA 11.0
  • CUDA toolkit version (e.g., CUDA 11.0)
  • NVIDIA driver version 461.40
  • RTX 3090
@nurpax
Copy link
Contributor

nurpax commented Feb 17, 2021

Ok, thanks for filing a separate bug. I’ll keep this one open. There are multiple different problems filed into separate bugs with comments about separate issues added into the same bug. So it gets messy.

Trying to use class AugmentPipe in my project.
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!

Can you give a bit of detail about your project structure? How are you making use of AugmentPipe in your project?

Is unmodified stylegan2-ada-pytorch project working for you?

@ghost
Copy link
Author

ghost commented Feb 18, 2021

Hi @nurpax disregard the first line, It was written for something else, I updated my issue with more details.

@nurpax
Copy link
Contributor

nurpax commented Feb 18, 2021

Just double checking: your version of stylegan2-ada-pytorch is unmodified and it still does not work?

If you run it in Docker, does it work then? Most users have no issue when running in Docker so you should check if that works and report here. (I understand some people don’t like using Docker but it’s good debug info to check if it works or not.)

Clearly one of the key problems with these custom extensions is that when something goes wrong in their build or first use, the error message throws away too much information about what exactly went wrong.

@ghost
Copy link
Author

ghost commented Feb 18, 2021

Yes Correct, I haven't made any changes to it. I just this morning cleaned my driver and made a fresh install, created a new anaconda env and downloaded a fresh copy from this repo but the same problem happens. I don't know why.

@nurpax
Copy link
Contributor

nurpax commented Feb 18, 2021

I think you've done this step but I'm adding it here for completeness, even if it may sound like I'm just repeating the same thing over and over.

The simplest form of getting this error:

"Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!"

is when there's no ninja installed with pip install ninja or conda install ninja. The error message unfortunately doesn't give any indication that ninja is missing.

I'm mentioning this here as %pip install ninja from the bug description seems to refer to Colab.

Also: can you please confirm that it works for you in Docker?

@ghost
Copy link
Author

ghost commented Feb 18, 2021

I actually tried both pip install ninja and conda install ninja with similar outcome.
for Docker, no I haven't tried it.

@SofianeBenkara
Copy link

SofianeBenkara commented Feb 18, 2021

@nurpax

I have been dealing with the same problem.

when I try to generate it works fine but it is slow and this is my output

Loading networks from "../../Data/ffhq.pkl"...
Generating image for seed 8201 (0/1) ...
Setting up PyTorch plugin "bias_act_plugin"... Failed!
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!

then at the end it does generate an image successfully,

for training and projecting,
I get this

Setting up augmentation...
Distributing across 1 GPUs...
Setting up training phases...
Exporting sample images...
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
..
..
...
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
Evaluating metrics...

It then get stuck at Evaluating metrics... then the kernel dies

when I try to project I get this

Loading networks from "..\..\Data\ffhq.pkl"...
Computing W midpoint and stddev using 10000 samples...
Setting up PyTorch plugin "bias_act_plugin"... Failed!
Downloading https://nvlabs-fi-cdn.nvidia.com/stylegan2-ada-pytorch/pretrained/metrics/vgg16.pt ... done
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
..
..
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!

this continue for few minutes then the kernel dies also.

I hope this help. I have tried all the solution proposed in the other issues opened and was not able to get this working. I have read that other people are having the same problem on reddit and no one is sure what's the problem.

@nurpax
Copy link
Contributor

nurpax commented Feb 18, 2021

What seems to be happening is that either the extension build somehow fails or the built extension is not able to run somehow. The pytorch code then will try to fallback to a reference implementation that is slower. It looks like this fallback mechanism is not working all too well, as it's trying to build on every invocation. This probably explains why it's so super slow.

I'd prefer if we'd find a real fix for this, of course, but here's one thing you could try. You could force the custom ops to always use the slower reference path. This will be slower but it should work.

I haven't tried this in a while, but I think you can force the reference implementation by editing the below function (and all the other similar _init functions in that folder):

https://github.com/NVlabs/stylegan2-ada-pytorch/blob/main/torch_utils/ops/bias_act.py#L41

def _init():
    global _inited, _plugin
    if not _inited:
        _inited = True
        sources = ['bias_act.cpp', 'bias_act.cu']
        sources = [os.path.join(os.path.dirname(__file__), s) for s in sources]
        try:
            _plugin = custom_ops.get_plugin('bias_act_plugin', sources=sources, extra_cuda_cflags=['--use_fast_math'])
        except:
            warnings.warn('Failed to build CUDA kernels for bias_act. Falling back to slow reference implementation. Details:\n\n' + str(sys.exc_info()[1]))
    return _plugin is not None

to just:

def _init():
    return False

@SBenkara is your repro on Docker or native installation of PyTorch and CUDA? What about the folks on Reddit?

@SofianeBenkara
Copy link

@nurpax
I haven't tried Linux nor Docker. I am using a Windows 10 with an RTX 3090 GPU and a native installation of PyTorch and Cuda 11.1 and followed all the step on the read me.

I will try if i could find the reddit post and linked it but most people there were using Windows/Linux and I don't remember seeing Docker related issue.

any idea why the extension build is failing? is there any logs i can get that would help?

I will try to make the changes you suggested for now until we fix this issue.

@SofianeBenkara
Copy link

SofianeBenkara commented Feb 18, 2021

I have some updates hopefully then can help in pinpointing the problem.

I forgot to mention that I was using Jupyter notebook. I am not sure what difference it makes but I didn't have any of those issues when I tried using a command line or PyCharm, I just did a pip install and everything started working flawlessly.

The problem might be related to either the Jupyter notebook or Anaconda. I made sure to create more environments to make sure that was not a problem with my anaconda env, but they all failed.

so I made the changes you suggested, it printed less line of Setting up PyTorch plugin "upfirdn2d_plugin"... Failed! without being able to do a projection or training as the kernel continued to stop was still getting the same error from other parts of the code, from \upfirdn2d.py mainly.

No module named 'upfirdn2d_plugin'
  warnings.warn('Failed to build CUDA kernels for upfirdn2d. Falling back to slow reference implementation. Details:\n\n' + str(sys.exc_info()[1]))
C:\Users\admin\Google Drive\PyTorch\stylegan2-ada-pytorch\torch_utils\ops\upfirdn2d.py:34: UserWarning: Failed to build CUDA kernels for upfirdn2d. Falling back to slow reference implementation. Details:

Edit:
after working just fine from my command line for few minute, it's now back at throwing the same error message without me making any changes

UserWarning: Distutils was imported before Setuptools. This usage is discouraged and may exhibit undesirable behaviors or errors. Please use S
etuptools' objects directly or at least import Setuptools fir



No module named 'upfirdn2d_plugin'
  warnings.warn('Failed to build CUDA kernels for upfirdn2d. Falling back to slow reference implementation. Details:\n\n' + str(sys.exc_info()[1]))
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
C:\Users\admin\Google Drive\stylegan2-ada-pytorch-main\stylegan2-ada-pytorch-main\torch_utils\ops\upfirdn2d.py:34: UserWarning: Failed to build CUDA kernels for upfirdn2d. Falling back to slow reference implementation. Details:

@nurpax
Copy link
Contributor

nurpax commented Feb 18, 2021

@SBenkara @DarXT3mpla4 can you try patching your stylegan2-ada-pytorch code as follows:

diff --git a/torch_utils/ops/bias_act.py b/torch_utils/ops/bias_act.py
index b092c7f..b6190f8 100755
--- a/torch_utils/ops/bias_act.py
+++ b/torch_utils/ops/bias_act.py
@@ -44,10 +44,7 @@ def _init():
         _inited = True
         sources = ['bias_act.cpp', 'bias_act.cu']
         sources = [os.path.join(os.path.dirname(__file__), s) for s in sources]
-        try:
-            _plugin = custom_ops.get_plugin('bias_act_plugin', sources=sources, extra_cuda_cflags=['--use_fast_math'])
-        except:
-            warnings.warn('Failed to build CUDA kernels for bias_act. Falling back to slow reference implementation. Details:\n\n' + str(sys.exc_info()[1]))
+        _plugin = custom_ops.get_plugin('bias_act_plugin', sources=sources, extra_cuda_cflags=['--use_fast_math'])
     return _plugin is not None
 
 #----------------------------------------------------------------------------
diff --git a/torch_utils/ops/upfirdn2d.py b/torch_utils/ops/upfirdn2d.py
index f768b2c..76ac2d6 100755
--- a/torch_utils/ops/upfirdn2d.py
+++ b/torch_utils/ops/upfirdn2d.py
@@ -28,10 +28,7 @@ def _init():
     if not _inited:
         sources = ['upfirdn2d.cpp', 'upfirdn2d.cu']
         sources = [os.path.join(os.path.dirname(__file__), s) for s in sources]
-        try:
-            _plugin = custom_ops.get_plugin('upfirdn2d_plugin', sources=sources, extra_cuda_cflags=['--use_fast_math'])
-        except:
-            warnings.warn('Failed to build CUDA kernels for upfirdn2d. Falling back to slow reference implementation. Details:\n\n' + str(sys.exc_info()[1]))
+        _plugin = custom_ops.get_plugin('upfirdn2d_plugin', sources=sources, extra_cuda_cflags=['--use_fast_math'])
     return _plugin is not None
 
 def _parse_scaling(scaling):

I.e., remove try/excepts from around the custom_ops.get_plugin() call.

It looks like some exception info is getting lost with the way try/except is written. For example, if I rename my ninja executable in my anaconda3 dirs and rerun with this change, I get a more informative stacktrace. With some luck, maybe this will reveal some new information about the error you are seeing.

Generating image for seed 85 (0/4) ...
Setting up PyTorch plugin "bias_act_plugin"... Failed!
/home/janne/dev/stylegan2-ada-pytorch/torch_utils/ops/bias_act.py:50: UserWarning: Failed to build CUDA kernels for bias_act. Falling back to slow reference implementation. Details:

Ninja is required to load C++ extensions
  warnings.warn('Failed to build CUDA kernels for bias_act. Falling back to slow reference implementation. Details:\n\n' + str(sys.exc_info()[1]))
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
Traceback (most recent call last):
  File "generate.py", line 127, in <module>
    generate_images() # pylint: disable=no-value-for-parameter
  File "/home/janne/anaconda3/lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/janne/anaconda3/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/janne/anaconda3/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/janne/anaconda3/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/janne/anaconda3/lib/python3.8/site-packages/click/decorators.py", line 21, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "generate.py", line 119, in generate_images
    img = G(z, label, truncation_psi=truncation_psi, noise_mode=noise_mode)
  File "/home/janne/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "<string>", line 491, in forward
  File "/home/janne/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "<string>", line 463, in forward
  File "/home/janne/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "<string>", line 397, in forward
  File "/home/janne/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "<string>", line 291, in forward
  File "/home/janne/dev/stylegan2-ada-pytorch/torch_utils/misc.py", line 101, in decorator
    return fn(*args, **kwargs)
  File "<string>", line 72, in modulated_conv2d
  File "/home/janne/dev/stylegan2-ada-pytorch/torch_utils/misc.py", line 101, in decorator
    return fn(*args, **kwargs)
  File "/home/janne/dev/stylegan2-ada-pytorch/torch_utils/ops/conv2d_resample.py", line 139, in conv2d_resample
    x = upfirdn2d.upfirdn2d(x=x, f=f, padding=[px0+pxt,px1+pxt,py0+pyt,py1+pyt], gain=up**2, flip_filter=flip_filter)
  File "/home/janne/dev/stylegan2-ada-pytorch/torch_utils/ops/upfirdn2d.py", line 159, in upfirdn2d
    if impl == 'cuda' and x.device.type == 'cuda' and _init():
  File "/home/janne/dev/stylegan2-ada-pytorch/torch_utils/ops/upfirdn2d.py", line 31, in _init
    _plugin = custom_ops.get_plugin('upfirdn2d_plugin', sources=sources, extra_cuda_cflags=['--use_fast_math'])
  File "/home/janne/dev/stylegan2-ada-pytorch/torch_utils/custom_ops.py", line 110, in get_plugin
    torch.utils.cpp_extension.load(name=module_name, verbose=verbose_build, sources=sources, **build_kwargs)
  File "/home/janne/anaconda3/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 986, in load
    return _jit_compile(
  File "/home/janne/anaconda3/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1193, in _jit_compile
    _write_ninja_file_and_build_library(
  File "/home/janne/anaconda3/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1268, in _write_ninja_file_and_build_library
    verify_ninja_availability()
  File "/home/janne/anaconda3/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1323, in verify_ninja_availability
    raise RuntimeError("Ninja is required to load C++ extensions")
RuntimeError: Ninja is required to load C++ extensions

@SofianeBenkara
Copy link

this is what I am getting now, also it just crash without any output

C:\Users\admin\Google Drive\PyTorch\stylegan2-ada-pytorch\torch_utils\ops\upfirdn2d.py:34: UserWarning: Failed to build CUDA kernels for upfirdn2d. Falling back to slow reference implementation. Details:

Error building extension 'upfirdn2d_plugin': [1/2] C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\bin\nvcc -Xcudafe --diag_suppress=dll_interface_conflict_dllexport_assumed -Xcudafe --diag_suppress=dll_interface_conflict_none_assumed -Xcudafe --diag_suppress=field_without_dll_interface -Xcudafe --diag_suppress=base_class_has_different_dll_interface -Xcompiler /EHsc -Xcompiler /wd4190 -Xcompiler /wd4018 -Xcompiler /wd4275 -Xcompiler /wd4267 -Xcompiler /wd4244 -Xcompiler /wd4251 -Xcompiler /wd4819 -Xcompiler /MD -DTORCH_EXTENSION_NAME=upfirdn2d_plugin -DTORCH_API_INCLUDE_EXTENSION_H -IC:\Users\admin\anaconda3\envs\ptx\lib\site-packages\torch\include -IC:\Users\admin\anaconda3\envs\ptx\lib\site-packages\torch\include\torch\csrc\api\include -IC:\Users\admin\anaconda3\envs\ptx\lib\site-packages\torch\include\TH -IC:\Users\admin\anaconda3\envs\ptx\lib\site-packages\torch\include\THC "-IC:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\include" -IC:\Users\admin\anaconda3\envs\ptx\Include -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=sm_86 --use_fast_math -c "C:\Users\admin\Google Drive\PyTorch\stylegan2-ada-pytorch\torch_utils\ops\upfirdn2d.cu" -o upfirdn2d.cuda.o 
FAILED: upfirdn2d.cuda.o 
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\bin\nvcc -Xcudafe --diag_suppress=dll_interface_conflict_dllexport_assumed -Xcudafe --diag_suppress=dll_interface_conflict_none_assumed -Xcudafe --diag_suppress=field_without_dll_interface -Xcudafe --diag_suppress=base_class_has_different_dll_interface -Xcompiler /EHsc -Xcompiler /wd4190 -Xcompiler /wd4018 -Xcompiler /wd4275 -Xcompiler /wd4267 -Xcompiler /wd4244 -Xcompiler /wd4251 -Xcompiler /wd4819 -Xcompiler /MD -DTORCH_EXTENSION_NAME=upfirdn2d_plugin -DTORCH_API_INCLUDE_EXTENSION_H -IC:\Users\admin\anaconda3\envs\ptx\lib\site-packages\torch\include -IC:\Users\admin\anaconda3\envs\ptx\lib\site-packages\torch\include\torch\csrc\api\include -IC:\Users\admin\anaconda3\envs\ptx\lib\site-packages\torch\include\TH -IC:\Users\admin\anaconda3\envs\ptx\lib\site-packages\torch\include\THC "-IC:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\include" -IC:\Users\admin\anaconda3\envs\ptx\Include -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=sm_86 --use_fast_math -c "C:\Users\admin\Google Drive\PyTorch\stylegan2-ada-pytorch\torch_utils\ops\upfirdn2d.cu" -o upfirdn2d.cuda.o 
nvcc fatal   : Unsupported gpu architecture 'compute_86'
ninja: build stopped: subcommand failed.

  warnings.warn('Failed to build CUDA kernels for upfirdn2d. Falling back to slow reference implementation. Details:\n\n' + str(sys.exc_info()[1]))
C:\Users\admin\Google Drive\PyTorch\stylegan2-ada-pytorch\torch_utils\ops\upfirdn2d.py:34: UserWarning: Failed to build CUDA kernels for upfirdn2d. Falling back to slow reference implementation. Details:

No module named 'upfirdn2d_plugin'
  warnings.warn('Failed to build CUDA kernels for upfirdn2d. Falling back to slow reference implementation. Details:\n\n' + str(sys.exc_info()[1]))

@nurpax
Copy link
Contributor

nurpax commented Feb 19, 2021

@SBenkara I guess you left the warnings.warn line there? My patch above had that taken out too.

Nevertheless, the error is a little more apparent now (emphasis mine):

Error building extension 'upfirdn2d_plugin': [1/2] C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\ v11.0 \bin
nvcc fatal : Unsupported gpu architecture 'compute_86'

CUDA 11.0 does not support compiling for compute_86 arch, to build for compute_86, you need CUDA 11.1. You can see from above that it's building with CUDA 11.0 nvcc.

Another way to verify what compiler versions and flags are actually used, you can check the build.ninja files under ~/.cache/torch_extensions/ (e.g., bias_act_plugin/build.ninja). I'm not sure where exactly this file resides on Windows. Please attach or copy&paste the full contents of one of these files here.

@SofianeBenkara
Copy link

ninja_required_version = 1.3
cxx = cl
nvcc = C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\bin\nvcc

cflags = -DTORCH_EXTENSION_NAME=bias_act_plugin -DTORCH_API_INCLUDE_EXTENSION_H -IC:\Users\admin\anaconda3\envs\ptx\lib\site-packages\torch\include -IC:\Users\admin\anaconda3\envs\ptx\lib\site-packages\torch\include\torch\csrc\api\include -IC:\Users\admin\anaconda3\envs\ptx\lib\site-packages\torch\include\TH -IC:\Users\admin\anaconda3\envs\ptx\lib\site-packages\torch\include\THC "-IC:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\include" -IC:\Users\admin\anaconda3\envs\ptx\Include -D_GLIBCXX_USE_CXX11_ABI=0 /MD /wd4819 /wd4251 /wd4244 /wd4267 /wd4275 /wd4018 /wd4190 /EHsc
post_cflags = 
cuda_cflags = -Xcudafe --diag_suppress=dll_interface_conflict_dllexport_assumed -Xcudafe --diag_suppress=dll_interface_conflict_none_assumed -Xcudafe --diag_suppress=field_without_dll_interface -Xcudafe --diag_suppress=base_class_has_different_dll_interface -Xcompiler /EHsc -Xcompiler /wd4190 -Xcompiler /wd4018 -Xcompiler /wd4275 -Xcompiler /wd4267 -Xcompiler /wd4244 -Xcompiler /wd4251 -Xcompiler /wd4819 -Xcompiler /MD -DTORCH_EXTENSION_NAME=bias_act_plugin -DTORCH_API_INCLUDE_EXTENSION_H -IC:\Users\admin\anaconda3\envs\ptx\lib\site-packages\torch\include -IC:\Users\admin\anaconda3\envs\ptx\lib\site-packages\torch\include\torch\csrc\api\include -IC:\Users\admin\anaconda3\envs\ptx\lib\site-packages\torch\include\TH -IC:\Users\admin\anaconda3\envs\ptx\lib\site-packages\torch\include\THC "-IC:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\include" -IC:\Users\admin\anaconda3\envs\ptx\Include -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=sm_86 --use_fast_math
cuda_post_cflags = 
ldflags = /DLL c10.lib c10_cuda.lib torch_cpu.lib torch_cuda.lib -INCLUDE:?warp_size@cuda@at@@YAHXZ torch.lib torch_python.lib /LIBPATH:C:\Users\admin\anaconda3\envs\ptx\libs /LIBPATH:C:\Users\admin\anaconda3\envs\ptx\lib\site-packages\torch\lib "/LIBPATH:C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\lib/x64" cudart.lib

rule compile
  command = cl /showIncludes $cflags -c $in /Fo$out $post_cflags
  deps = msvc

rule cuda_compile
  command = $nvcc $cuda_cflags -c $in -o $out $cuda_post_cflags

rule link
  command = "C$:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.28.29333\bin\Hostx64\x64/link.exe" $in /nologo $ldflags /out:$out

build bias_act.o: compile C$:\Users\admin\Google$ Drive\stylegan2-ada-pytorch-main\torch_utils\ops\bias_act.cpp
build bias_act.cuda.o: cuda_compile C$:\Users\admin\Google$ Drive\stylegan2-ada-pytorch-main\torch_utils\ops\bias_act.cu

build bias_act_plugin.pyd: link bias_act.o bias_act.cuda.o

default bias_act_plugin.pyd

@nurpax
Copy link
Contributor

nurpax commented Feb 19, 2021

Yes, definitely confirms that 11.0 is being used instead of 11.1.

What you will need is to install CUDA 11.1 toolkit from NVIDIA and make sure that you set it up so that 11.1 version comes up first in PATH. E.g., try running "nvcc --version" and check that it's the right version. On my computer this reports something like this:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Mon_Oct_12_20:09:46_PDT_2020
Cuda compilation tools, release 11.1, V11.1.105
Build cuda_11.1.TC455_06.29190527_0

@SofianeBenkara
Copy link

my nvcc --version returns

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Tue_Sep_15_19:12:04_Pacific_Daylight_Time_2020
Cuda compilation tools, release 11.1, V11.1.74
Build cuda_11.1.relgpu_drvr455TC455_06.29069683_0

all my environment variable are pointing to cuda_11.1

I am not understanding where 11.0 is coming from. I used it before but then switch to 11.1

I deleted the bias_act_plugin/build.ninja and tried again and indeed it shows 11.0

I will keep you posted

@SofianeBenkara
Copy link

@nurpax you were 100% right. even though my nvcc --version was returning 11.1 somehow the 11.0 was being used.

I had a both versions installed on my computer but my environment only pointing to the 11.1
After uninstalling the 11.0 version and rebooting the computer everything is working great!! without any issue

Thank you so much!

nurpax pushed a commit that referenced this issue Feb 19, 2021
Print full traceback when custom extension build fails.

Also allow pytorch 1.9 so that this runs against pytorch upstream
devel builds.

issues #2, #28, #35, #37, #39
@nurpax
Copy link
Contributor

nurpax commented Feb 19, 2021

I pushed change 2506395 that improves error reporting. Hopefully custom extension build errors get correctly reported now and root causing these problems will be easier.

@ghost
Copy link
Author

ghost commented Feb 21, 2021

@nurpax I had three version of Cuda installed 10, 11.1 and 11.2 as I was using those extensions for other projects I had to delete those other version to make the project work. thanks for helping. I wonder if it has anything to do whit this

@nurpax
Copy link
Contributor

nurpax commented Feb 21, 2021

Great!

I wonder if it has anything to do whit this

I can’t tell without seeing logs with exception info or build.ninja files for failed attempts.

At least in SBenkara’s case, a wrong version of nvcc was chosen. I assume there were multiple CUDA versions in PATH. I don’t know if there are bugs in CUDA tools discovery code in PyTorch.

@avshalomman
Copy link

@nurpax Jumping on this thread since I think I'm experiencing something related, hope it's ok...

I'm training on colab, using the following prompt:
python train.py --outdir=training-runs --data=/dataset --gpus=1 --cfg=paper256 --mirror=1 --resume=ffhq256 --snap=1

The dataset contains only 10 photos, so I'm basically trying transfer learning with small data.

I encountered the "Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!" issue at first, which I resolved by installing ninja.

However, training speed is still super-slow, and the issue seems to be in the "Evaluating metrics" part.
I did as you suggested and edited the init methods of the custom cuda ops, and it didn't help.

These are the evaluation stats:
"metric": "fid50k_full", "total_time": 813.027854681015, "total_time_str": "13m 33s"
Which seem awfully slow for a 10 images dataset.

I'm running on colab, cuda version 11.2, T4 GPU.

Thanks in advance!

@nurpax
Copy link
Contributor

nurpax commented Feb 24, 2021

@avshalomman Please file separate bugs for separate issues. You can try with --metrics=none, most likely it's computing metrics that'ts taking a long time for you.

Closing this bug as both plugin issues seem to have been resolved.

@cunicode
Copy link

cunicode commented May 9, 2021

installing gcc in the linux machine solved the for "No module named 'upfirdn2d_plugin'" for me.

check if you have gcc: gcc --version
if not, install it with sudo apt install build-essential

BrandoZhang added a commit to BrandoZhang/alis that referenced this issue May 26, 2021
Solve permission issue of `upfirdn2d_plugin` compilation. 
See NVlabs/stylegan2-ada-pytorch#39 and pytorch/pytorch@1301384 .
@lucky7323
Copy link

work with just install ninja for me.
pip3 install ninja

@metaphorz
Copy link

I am running on a CentOS platform and got the stylegan2-ada-pytorch notebook to work fine except when it reaches the training stage "python train.py ....". I am getting errors for both bias_act_plugin and upfirdn2d_plugin. I have tried some of the suggestions here but wonder if there is a resolution? Perhaps I am not using the right version of CUDA or Pytorch? My Pytorch is 1.7.1. Here is where the errors and tracebacks begin:

Constructing networks...
starting G epochs: 0.0
starting G epochs: starting G epochs: 0.00.0
starting G epochs: 0.0
Resuming from "./pretrained/wikiart.pkl"
Setting up PyTorch plugin "bias_act_plugin"... Failed!
....deleted path...orch_utils/ops/bias_act.py:50: UserWarning: Failed to build CUDA kernels for bias_act. Falling back to slow reference implementation.

@thusinh1969
Copy link

Remove ~/.cache/torch_extensions/* if you have installed some new version of torch or torch vision or whatever in between 2 run. Re-run train.py will rebuild those plugins.

Took me a couple of hours!

Steve

@youngjae-git
Copy link

Remove ~/.cache/torch_extensions/* if you have installed some new version of torch or torch vision or whatever in between 2 run. Re-run train.py will rebuild those plugins.

Took me a couple of hours!

Steve

Thank you Steve !!
Finally, solve a problem.

@alirezag
Copy link

alirezag commented Aug 3, 2021

Simply installing ninja solved this for me. I'm on cuda 11.1.

@stossenbrink
Copy link

Hope this helps someone: I solved this issue by installing nvidia-cuda-toolkit (via apt), removed ninja from my pipenv and installed it again. After restarting my jupyter python kernel, the modules where built.

@lennysunreal
Copy link

lennysunreal commented Aug 18, 2021

Hope this helps someone: I solved this issue by installing nvidia-cuda-toolkit (via apt), removed ninja from my pipenv and installed it again. After restarting my jupyter python kernel, the modules where built.

Sorry, Imma a complete noob.
How do uninstall and re-install ninja?
Also When you say "installing nvidia-cuda-toolkit (via apt)" do you mean just download the latest windows tool kit exe and install it or do you mean install it via command line in the powershell?

@alirezag
Copy link

Are you familiar with pip? pip uninstall ninja should do it.

@darrelfrancis
Copy link

darrelfrancis commented Aug 22, 2021

Summary of steps I carried out that worked

  1. pip uninstall ninja
  2. pip install ninja
  3. rm -rf ~/.cache/torch_extensions/*

I actually think it is #3 that worked for me. Next time I ran the python code, it reported that it was installing those two extensions, and all went well.

@Feywell
Copy link

Feywell commented Aug 27, 2021

Remove ~/.cache/torch_extensions/* if you have installed some new version of torch or torch vision or whatever in between 2 run. Re-run train.py will rebuild those plugins.

Took me a couple of hours!

Steve

Thank you! It is the truth.

@colt18
Copy link

colt18 commented Sep 2, 2021

Summary of steps I carried out that worked

1. pip uninstall ninja

2. pip install ninja

3. rm -rf ~/.cache/torch_extensions/*

I actually think it is #3 that worked for me. Next time I ran the python code, it reported that it was installing those two extensions, and all went well.

Can I get the windows path for "~/.cache/torch_extensions/*".

@nurpax
Copy link
Contributor

nurpax commented Sep 2, 2021

Try C:\Users\<username>\AppData\Local\torch_extensions\torch_extensions\Cache.

@zzningxp
Copy link

zzningxp commented Jun 10, 2022

My problem is: when I use ONE GPU to train, there is not any problems. when I use TWO GPU to train, it comes such problems.
I have tried the methods above, but no.
ubuntu 18.04
nvcc = 10.1, V10.1.105

Setting up augmentation...
Distributing across 2 GPUs...
Setting up training phases...
Exporting sample images...
/home//stylegan2-ada-pytorch/torch_utils/ops/upfirdn2d.py:34: UserWarning: Failed to build CUDA kernels for upfirdn2d. Falling back to slow reference implementation. Details:

Traceback (most recent call last):
  File "/home//stylegan2-ada-pytorch/torch_utils/ops/upfirdn2d.py", line 32, in _init
    _plugin = custom_ops.get_plugin('upfirdn2d_plugin', sources=sources, extra_cuda_cflags=['--use_fast_math'])
  File "/home//stylegan2-ada-pytorch/torch_utils/custom_ops.py", line 110, in get_plugin
    torch.utils.cpp_extension.load(name=module_name, verbose=verbose_build, sources=sources, **build_kwargs)
  File "/opt/miniconda3/envs/py37torch17/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1091, in load
    keep_intermediates=keep_intermediates)
  File "/opt/miniconda3/envs/py37torch17/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1302, in _jit_compile
    is_standalone=is_standalone)
  File "/opt/miniconda3/envs/py37torch17/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1378, in _write_ninja_file_and_build_library
    check_compiler_abi_compatibility(compiler)
  File "/opt/miniconda3/envs/py37torch17/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 282, in check_compiler_abi_compatibility
    if not check_compiler_ok_for_platform(compiler):
  File "/opt/miniconda3/envs/py37torch17/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 242, in check_compiler_ok_for_platform
    which = subprocess.check_output(['which', compiler], stderr=subprocess.STDOUT)
  File "/opt/miniconda3/envs/py37torch17/lib/python3.7/subprocess.py", line 411, in check_output
    **kwargs).stdout
  File "/opt/miniconda3/envs/py37torch17/lib/python3.7/subprocess.py", line 488, in run
    with Popen(*popenargs, **kwargs) as process:
  File "/opt/miniconda3/envs/py37torch17/lib/python3.7/subprocess.py", line 800, in __init__
    restore_signals, start_new_session)
  File "/opt/miniconda3/envs/py37torch17/lib/python3.7/subprocess.py", line 1482, in _execute_child
    restore_signals, start_new_session, preexec_fn)
OSError: [Errno 12] Cannot allocate memory

  warnings.warn('Failed to build CUDA kernels for upfirdn2d. Falling back to slow reference implementation. Details:\n\n' + traceback.format_exc()) 

@NickDienemann
Copy link

NickDienemann commented Sep 20, 2022

Remove ~/.cache/torch_extensions/* if you have installed some new version of torch or torch vision or whatever in between 2 run.

I have been working with style gan 2 ada for a couple of weeks and everything worked perfectly fine. However, this morning, the upfird2nd_plugin was not able to build the cuda kernels anymore and got stuck in the "Setting up PyTorch plugin "upfirdn2d_plugin"... " prompt.

Deleting the cache files as Steve @thusinh1969 has proposed fixed the issue, thanks a lot :)

I am very confused tho on how this problem arrises after having a working implementation and not changing anything. Maybe the build fails with a low probability and when it fails, it causes subsequent builds to fail aswell?

@MeieiShaw
Copy link

I do not know if this will help or not, I added my PyTorch version here in the file conv2d_gradfix.py

def _should_use_custom_op(input):
    assert isinstance(input, torch.Tensor)
    if (not enabled) or (not torch.backends.cudnn.enabled):
        return False
    if input.device.type != 'cuda':
        return False
    if any(torch.__version__.startswith(x) for x in ['1.7.', '1.8.', '1.9', **_'YOUR OWN VERSION HERE'_**]):
        return True
    warnings.warn(f'conv2d_gradfix not supported on PyTorch {torch.__version__}. Falling back to torch.nn.functional.conv2d().')
    return False  

@mhgenerate
Copy link

Hopefully this can help someone, but this is how I fixed my error.

Firstly I deleted the cache plugin folders > touch_extensions/cache/bias_act_plugin & upfirdn2d_plugin

I had multiple CUDA toolkits in PATH (11.2 and 11.8) I had to delete 11.8 and then ran the code again and it worked perfectly. It might be different for you but if you have multiple paths, it could be the issue.

OP: #67 (comment)

@philadias
Copy link

In case it helps someone, in my case I actually just had to run it twice for things to work.

I ended up here trying to config a project that builds upon this repo (https://github.com/voletiv/mcvd-pytorch), and hitting the same old torch_extensions/[...]/upfirdn2d.so: cannot open shared object file: No such file or directory .

In my case, the fix (or maybe more of a workaround?) is that i had to run twice. The first time it would throw the error, but the .so was actually generated in the folder, so when running a second time it actually got to run fine. Since i'm running with multiple parallel devices (GPUs), my takeaway is that during the first run the lack of sync led to some worker to not find the .so file while it was still being generated. For the second run onwards, all workers are able to find it properly

@LiangSylar
Copy link

Try !pip install ninja==1.10.2 instead of !pip install ninja. This solves the problem for me. I had the same issue before, but specifying the ninja version definitely solved the problem in my case.

@xingyouxin
Copy link

@nurpax

I have been dealing with the same problem.

when I try to generate it works fine but it is slow and this is my output

Loading networks from "../../Data/ffhq.pkl"...
Generating image for seed 8201 (0/1) ...
Setting up PyTorch plugin "bias_act_plugin"... Failed!
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!

then at the end it does generate an image successfully,

for training and projecting, I get this

Setting up augmentation...
Distributing across 1 GPUs...
Setting up training phases...
Exporting sample images...
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
..
..
...
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
Evaluating metrics...

It then get stuck at Evaluating metrics... then the kernel dies

when I try to project I get this

Loading networks from "..\..\Data\ffhq.pkl"...
Computing W midpoint and stddev using 10000 samples...
Setting up PyTorch plugin "bias_act_plugin"... Failed!
Downloading https://nvlabs-fi-cdn.nvidia.com/stylegan2-ada-pytorch/pretrained/metrics/vgg16.pt ... done
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
..
..
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!

this continue for few minutes then the kernel dies also.

I hope this help. I have tried all the solution proposed in the other issues opened and was not able to get this working. I have read that other people are having the same problem on reddit and no one is sure what's the problem.

遇到了相同的问题。我使用的是Linux平台,最终的解决办法是:

  1. 完全卸载所有的cuda相关的内容;
  2. 在虚拟环境中(私人用户)和虚拟环境外(root用户)统一安装相同的cuda版本;
  3. cuda版本符合显卡的要求,比如:我用的RTX4090,采用的CUDA版本是11.7;
  4. 注意:虚拟环境外,先装nVidia驱动(注意,nvidia-smi输出的cuda版本和我们后面装的cuda版本不会冲突,他们代表了一个是驱动的cuda版本,一个是runtime的cuda版本),符合系统要求的新版本即可,再装cuda,安装时候跳过device(也就是nVidia驱动);虚拟环境内,在torch官网找对应cuda版本的控制台命令安装即可。
    我的做法可以成功运行。具体出现【Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!】的原因,我理解的是虚拟环境内外的cuda版本不一致,同时即使是外部卸载干净了cuda,仍会爆出相同的错误,所以有可能是虚拟环境内在运行cuda的时候仍然涉及到了外部cuda的调用,所以我采用了一种干脆的能解决问题的办法就是保证外部cuda版本和内部cuda版本一致。

@yusufbtanriverdi
Copy link

speaking from the future, I have this problem with

cuda 12.1
w10
python 3.9
pytorch 2.1.0+cu121

I will change Cuda version to see if I can make it work

@meyurtsever
Copy link

meyurtsever commented Nov 13, 2023

After reverting back to Ubuntu 20.04 LTS, I've managed to make it work without any problem.
Cuda 11.2
Nvidia Driver 460.27.04
torch 1.10.0+cu111
torchvision 0.11.1+cu111
Python 3.7

I also applied the changes in this PR: #197

For installing Nvidia drivers and CUDA, I followed this:
https://yakhyo.medium.com/cuda-11-2-installation-on-ubuntu-20-04-e83f7561ccc1

@aA13142968398
Copy link

What seems to be happening is that either the extension build somehow fails or the built extension is not able to run somehow. The pytorch code then will try to fallback to a reference implementation that is slower. It looks like this fallback mechanism is not working all too well, as it's trying to build on every invocation. This probably explains why it's so super slow.

I'd prefer if we'd find a real fix for this, of course, but here's one thing you could try. You could force the custom ops to always use the slower reference path. This will be slower but it should work.

I haven't tried this in a while, but I think you can force the reference implementation by editing the below function (and all the other similar _init functions in that folder):

https://github.com/NVlabs/stylegan2-ada-pytorch/blob/main/torch_utils/ops/bias_act.py#L41

def _init():
    global _inited, _plugin
    if not _inited:
        _inited = True
        sources = ['bias_act.cpp', 'bias_act.cu']
        sources = [os.path.join(os.path.dirname(__file__), s) for s in sources]
        try:
            _plugin = custom_ops.get_plugin('bias_act_plugin', sources=sources, extra_cuda_cflags=['--use_fast_math'])
        except:
            warnings.warn('Failed to build CUDA kernels for bias_act. Falling back to slow reference implementation. Details:\n\n' + str(sys.exc_info()[1]))
    return _plugin is not None

to just:

def _init():
    return False

@SBenkara is your repro on Docker or native installation of PyTorch and CUDA? What about the folks on Reddit?

It works!
Maybe this means "if you can't use it, you close it"

@userd171
Copy link

warnings.warn('Failed to build CUDA kernels for upfirdn2d. Falling back to slow reference implementation. Details:\n\n' + traceback.format_exc())
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
F:\CVD-GAN-main\torch_utils\ops\upfirdn2d.py:34: UserWarning: Failed to build CUDA kernels for upfirdn2d. Falling back to slow reference implementation. Details:

Traceback (most recent call last):
File "F:\CVD-GAN-main\torch_utils\ops\upfirdn2d.py", line 32, in _init
_plugin = custom_ops.get_plugin('upfirdn2d_plugin', sources=sources, extra_cuda_cflags=['--use_fast_math'])
File "F:\CVD-GAN-main\torch_utils\custom_ops.py", line 111, in get_plugin
torch.utils.cpp_extension.load(name=module_name, verbose=verbose_build, sources=sources, **build_kwargs)
File "E:\env\anaconda3\envs\cvdgan1\lib\site-packages\torch\utils\cpp_extension.py", line 1079, in load
return _jit_compile(
File "E:\env\anaconda3\envs\cvdgan1\lib\site-packages\torch\utils\cpp_extension.py", line 1317, in _jit_compile
return _import_module_from_library(name, build_directory, is_python_module)
File "E:\env\anaconda3\envs\cvdgan1\lib\site-packages\torch\utils\cpp_extension.py", line 1703, in _import_module_from_library
return imp.load_module(module_name, file, path, description) # type: ignore
File "E:\env\anaconda3\envs\cvdgan1\lib\imp.py", line 242, in load_module
return load_dynamic(name, filename, file)
File "E:\env\anaconda3\envs\cvdgan1\lib\imp.py", line 342, in load_dynamic
return _load(spec)
File "", line 702, in _load
File "", line 657, in _load_unlocked
File "", line 556, in module_from_spec
File "", line 1166, in create_module
File "", line 219, in _call_with_frames_removed
ImportError: DLL load failed while importing upfirdn2d_plugin: 找不到指定的模块。

I hope this help. I have tried all the solution proposed in the other issues opened and was not able to get this working.
@nurpax

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests