Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ninja: build stopped: subcommand failed. #2

Closed
Msadat97 opened this issue Oct 11, 2021 · 17 comments
Closed

ninja: build stopped: subcommand failed. #2

Msadat97 opened this issue Oct 11, 2021 · 17 comments

Comments

@Msadat97
Copy link

Msadat97 commented Oct 11, 2021

Dear Authors,

I get the following errors when running the code using the stylegan3-t config:

Setting up PyTorch plugin "filtered_lrelu_plugin"... Failed!
Traceback (most recent call last):
  File "/cluster/home/user/development/lib64/python3.8/site-packages/torch/utils/cpp_extension.py", line 1666, in _run_ninja_build
    subprocess.run(
  File "/cluster/apps/nss/gcc-6.3.0/python/3.8.5/x86_64/lib64/python3.8/subprocess.py", line 512, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "train.py", line 286, in <module>
    main() # pylint: disable=no-value-for-parameter
  File "/cluster/home/user/development/lib64/python3.8/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/cluster/home/user/development/lib64/python3.8/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/cluster/home/user/development/lib64/python3.8/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/cluster/home/user/development/lib64/python3.8/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "train.py", line 281, in main
    launch_training(c=c, desc=desc, outdir=opts.outdir, dry_run=opts.dry_run)
  File "train.py", line 96, in launch_training
    subprocess_fn(rank=0, c=c, temp_dir=temp_dir)
  File "train.py", line 47, in subprocess_fn
    training_loop.training_loop(rank=rank, **c)
  File "/cluster/home/user/gan/stylegan3/training/training_loop.py", line 168, in training_loop
    img = misc.print_module_summary(G, [z, c])
  File "/cluster/home/user/gan/stylegan3/torch_utils/misc.py", line 216, in print_module_summary
    outputs = module(*inputs)
  File "/cluster/home/user/development/lib64/python3.8/site-packages/torch/nn/modules/module.py", line 1071, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/cluster/home/user/gan/stylegan3/training/networks_stylegan3.py", line 512, in forward
    img = self.synthesis(ws, update_emas=update_emas, **synthesis_kwargs)
  File "/cluster/home/user/development/lib64/python3.8/site-packages/torch/nn/modules/module.py", line 1071, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/cluster/home/user/gan/stylegan3/training/networks_stylegan3.py", line 471, in forward
    x = getattr(self, name)(x, w, **layer_kwargs)
  File "/cluster/home/user/development/lib64/python3.8/site-packages/torch/nn/modules/module.py", line 1071, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/cluster/home/user/gan/stylegan3/training/networks_stylegan3.py", line 355, in forward
    x = filtered_lrelu.filtered_lrelu(x=x, fu=self.up_filter, fd=self.down_filter, b=self.bias.to(x.dtype),
  File "/cluster/home/user/gan/stylegan3/torch_utils/ops/filtered_lrelu.py", line 114, in filtered_lrelu
    if impl == 'cuda' and x.device.type == 'cuda' and _init():
  File "/cluster/home/user/gan/stylegan3/torch_utils/ops/filtered_lrelu.py", line 26, in _init
    _plugin = custom_ops.get_plugin(
  File "/cluster/home/user/gan/stylegan3/torch_utils/custom_ops.py", line 136, in get_plugin
    torch.utils.cpp_extension.load(name=module_name, build_directory=cached_build_dir,
  File "/cluster/home/user/development/lib64/python3.8/site-packages/torch/utils/cpp_extension.py", line 1080, in load
    return _jit_compile(
  File "/cluster/home/user/development/lib64/python3.8/site-packages/torch/utils/cpp_extension.py", line 1293, in _jit_compile
    _write_ninja_file_and_build_library(
  File "/cluster/home/user/development/lib64/python3.8/site-packages/torch/utils/cpp_extension.py", line 1405, in _write_ninja_file_and_build_library
    _run_ninja_build(
  File "/cluster/home/user/development/lib64/python3.8/site-packages/torch/utils/cpp_extension.py", line 1682, in _run_ninja_build
    raise RuntimeError(message) from e
RuntimeError: Error building extension 'filtered_lrelu_plugin': [1/5] /cluster/spack/apps/linux-centos7-x86_64/gcc-4.8.5/gcc-6.3.0-sqhtfh32p5gerbkvi5hih7cfvcpmewvj/bin/g++ -MMD -MF filtered_lrelu.o.d -DTORCH_EXTENSION_NAME=filtered_lrelu_plugin -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /cluster/home/user/development/lib64/python3.8/site-packages/torch/include -isystem /cluster/home/user/development/lib64/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /cluster/home/user/development/lib64/python3.8/site-packages/torch/include/TH -isystem /cluster/home/user/development/lib64/python3.8/site-packages/torch/include/THC -isystem /cluster/apps/gcc-6.3.0/cuda-11.1.1-s2fmzfqahrfvezvmg4tslqqedhl3bggv/include -isystem /cluster/apps/nss/gcc-6.3.0/python/3.8.5/x86_64/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -c /cluster/home/user/.cache/torch_extensions/filtered_lrelu_plugin/0c8207121140d174807b17c24d32b436-tesla-v100-sxm2-32gb/filtered_lrelu.cpp -o filtered_lrelu.o 
FAILED: filtered_lrelu.o 
/cluster/spack/apps/linux-centos7-x86_64/gcc-4.8.5/gcc-6.3.0-sqhtfh32p5gerbkvi5hih7cfvcpmewvj/bin/g++ -MMD -MF filtered_lrelu.o.d -DTORCH_EXTENSION_NAME=filtered_lrelu_plugin -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /cluster/home/user/development/lib64/python3.8/site-packages/torch/include -isystem /cluster/home/user/development/lib64/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /cluster/home/user/development/lib64/python3.8/site-packages/torch/include/TH -isystem /cluster/home/user/development/lib64/python3.8/site-packages/torch/include/THC -isystem /cluster/apps/gcc-6.3.0/cuda-11.1.1-s2fmzfqahrfvezvmg4tslqqedhl3bggv/include -isystem /cluster/apps/nss/gcc-6.3.0/python/3.8.5/x86_64/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -c /cluster/home/user/.cache/torch_extensions/filtered_lrelu_plugin/0c8207121140d174807b17c24d32b436-tesla-v100-sxm2-32gb/filtered_lrelu.cpp -o filtered_lrelu.o 
In file included from /cluster/home/user/development/lib/python3.8/site-packages/torch/include/ATen/ATen.h:13:0,
                 from /cluster/home/user/development/lib/python3.8/site-packages/torch/include/torch/csrc/api/include/torch/types.h:3,
                 from /cluster/home/user/development/lib/python3.8/site-packages/torch/include/torch/csrc/api/include/torch/data/dataloader_options.h:4,
                 from /cluster/home/user/development/lib/python3.8/site-packages/torch/include/torch/csrc/api/include/torch/data/dataloader/base.h:3,
                 from /cluster/home/user/development/lib/python3.8/site-packages/torch/include/torch/csrc/api/include/torch/data/dataloader/stateful.h:3,
                 from /cluster/home/user/development/lib/python3.8/site-packages/torch/include/torch/csrc/api/include/torch/data/dataloader.h:3,
                 from /cluster/home/user/development/lib/python3.8/site-packages/torch/include/torch/csrc/api/include/torch/data.h:3,
                 from /cluster/home/user/development/lib/python3.8/site-packages/torch/include/torch/csrc/api/include/torch/all.h:8,
                 from /cluster/home/user/development/lib/python3.8/site-packages/torch/include/torch/extension.h:4,
                 from /cluster/home/user/.cache/torch_extensions/filtered_lrelu_plugin/0c8207121140d174807b17c24d32b436-tesla-v100-sxm2-32gb/filtered_lrelu.cpp:9:
/cluster/home/user/.cache/torch_extensions/filtered_lrelu_plugin/0c8207121140d174807b17c24d32b436-tesla-v100-sxm2-32gb/filtered_lrelu.cpp: In lambda function:
/cluster/home/user/.cache/torch_extensions/filtered_lrelu_plugin/0c8207121140d174807b17c24d32b436-tesla-v100-sxm2-32gb/filtered_lrelu.cpp:149:12: error: expected ‘(’ before ‘constexpr’
         if constexpr (sizeof(scalar_t) <= 4) // Exclude doubles. constexpr prevents template instantiation.
            ^
/cluster/home/user/.cache/torch_extensions/filtered_lrelu_plugin/0c8207121140d174807b17c24d32b436-tesla-v100-sxm2-32gb/filtered_lrelu.cpp: In lambda function:
/cluster/home/user/.cache/torch_extensions/filtered_lrelu_plugin/0c8207121140d174807b17c24d32b436-tesla-v100-sxm2-32gb/filtered_lrelu.cpp:149:12: error: expected ‘(’ before ‘constexpr’
         if constexpr (sizeof(scalar_t) <= 4) // Exclude doubles. constexpr prevents template instantiation.
            ^
/cluster/home/user/.cache/torch_extensions/filtered_lrelu_plugin/0c8207121140d174807b17c24d32b436-tesla-v100-sxm2-32gb/filtered_lrelu.cpp: In lambda function:
/cluster/home/user/.cache/torch_extensions/filtered_lrelu_plugin/0c8207121140d174807b17c24d32b436-tesla-v100-sxm2-32gb/filtered_lrelu.cpp:149:12: error: expected ‘(’ before ‘constexpr’
         if constexpr (sizeof(scalar_t) <= 4) // Exclude doubles. constexpr prevents template instantiation.
            ^
[2/5] /cluster/apps/gcc-6.3.0/cuda-11.1.1-s2fmzfqahrfvezvmg4tslqqedhl3bggv/bin/nvcc  -ccbin /cluster/spack/apps/linux-centos7-x86_64/gcc-4.8.5/gcc-6.3.0-sqhtfh32p5gerbkvi5hih7cfvcpmewvj/bin/gcc -DTORCH_EXTENSION_NAME=filtered_lrelu_plugin -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /cluster/home/user/development/lib64/python3.8/site-packages/torch/include -isystem /cluster/home/user/development/lib64/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /cluster/home/user/development/lib64/python3.8/site-packages/torch/include/TH -isystem /cluster/home/user/development/lib64/python3.8/site-packages/torch/include/THC -isystem /cluster/apps/gcc-6.3.0/cuda-11.1.1-s2fmzfqahrfvezvmg4tslqqedhl3bggv/include -isystem /cluster/apps/nss/gcc-6.3.0/python/3.8.5/x86_64/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_70,code=compute_70 -gencode=arch=compute_70,code=sm_70 --compiler-options '-fPIC' --use_fast_math -std=c++14 -c /cluster/home/user/.cache/torch_extensions/filtered_lrelu_plugin/0c8207121140d174807b17c24d32b436-tesla-v100-sxm2-32gb/filtered_lrelu_ns.cu -o filtered_lrelu_ns.cuda.o 
[3/5] /cluster/apps/gcc-6.3.0/cuda-11.1.1-s2fmzfqahrfvezvmg4tslqqedhl3bggv/bin/nvcc  -ccbin /cluster/spack/apps/linux-centos7-x86_64/gcc-4.8.5/gcc-6.3.0-sqhtfh32p5gerbkvi5hih7cfvcpmewvj/bin/gcc -DTORCH_EXTENSION_NAME=filtered_lrelu_plugin -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /cluster/home/user/development/lib64/python3.8/site-packages/torch/include -isystem /cluster/home/user/development/lib64/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /cluster/home/user/development/lib64/python3.8/site-packages/torch/include/TH -isystem /cluster/home/user/development/lib64/python3.8/site-packages/torch/include/THC -isystem /cluster/apps/gcc-6.3.0/cuda-11.1.1-s2fmzfqahrfvezvmg4tslqqedhl3bggv/include -isystem /cluster/apps/nss/gcc-6.3.0/python/3.8.5/x86_64/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_70,code=compute_70 -gencode=arch=compute_70,code=sm_70 --compiler-options '-fPIC' --use_fast_math -std=c++14 -c /cluster/home/user/.cache/torch_extensions/filtered_lrelu_plugin/0c8207121140d174807b17c24d32b436-tesla-v100-sxm2-32gb/filtered_lrelu_rd.cu -o filtered_lrelu_rd.cuda.o 
[4/5] /cluster/apps/gcc-6.3.0/cuda-11.1.1-s2fmzfqahrfvezvmg4tslqqedhl3bggv/bin/nvcc  -ccbin /cluster/spack/apps/linux-centos7-x86_64/gcc-4.8.5/gcc-6.3.0-sqhtfh32p5gerbkvi5hih7cfvcpmewvj/bin/gcc -DTORCH_EXTENSION_NAME=filtered_lrelu_plugin -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /cluster/home/user/development/lib64/python3.8/site-packages/torch/include -isystem /cluster/home/user/development/lib64/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /cluster/home/user/development/lib64/python3.8/site-packages/torch/include/TH -isystem /cluster/home/user/development/lib64/python3.8/site-packages/torch/include/THC -isystem /cluster/apps/gcc-6.3.0/cuda-11.1.1-s2fmzfqahrfvezvmg4tslqqedhl3bggv/include -isystem /cluster/apps/nss/gcc-6.3.0/python/3.8.5/x86_64/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_70,code=compute_70 -gencode=arch=compute_70,code=sm_70 --compiler-options '-fPIC' --use_fast_math -std=c++14 -c /cluster/home/user/.cache/torch_extensions/filtered_lrelu_plugin/0c8207121140d174807b17c24d32b436-tesla-v100-sxm2-32gb/filtered_lrelu_wr.cu -o filtered_lrelu_wr.cuda.o 
ninja: build stopped: subcommand failed.

The exact command that I'm using is:

python train.py --outdir=./training-runs --data=$WORKDIR/data.zip --cfg=stylegan3-t --gpus=2 --batch=32 --gamma=8.2

System Details:

  • OS: CentOS Linux release 7.9.2009
  • PyTorch version: 1.9.1
  • CUDA toolkit version: 11.1.1
  • NVIDIA driver version: 450.80.02
  • GPU: V100
  • GCC version: 6.3

Thank you for your help in advance.

@aicrumb
Copy link

aicrumb commented Oct 12, 2021

i had the same error and fixed it with

pip install ninja

@nurpax
Copy link
Contributor

nurpax commented Oct 12, 2021

It looks like OP's build goes further than that since it's already running nvcc. Most likely they have Ninja installed.

I don't know if I've hit this exact same problem myself, but my guess is that it's failing to compile due to a too old GCC version (maybe lack/adequate support for constexpr?).

This SO post suggests GCC 6.x might have some trouble with constexpr-if expression. See if it works on GCC 7.x?

This comment from stylegan2-ada-pytorch issues may be helpful. (Although the comment says GCC min version 6.. which you seem to have.) Cross referencing here anyway in case it helps.

BTW: in this release we created troubleshooting doc that we hope to expand as we come across different types of problems with stylegan3. I'm hoping that this will grow into a useful resource for diagnosing and fixing problems with custom op compiles.

@Msadat97
Copy link
Author

I can confirm that using GCC 8.2 solves the problem. Maybe you should add a minimum required version of the GCC in the readme as well?

@nurpax
Copy link
Contributor

nurpax commented Oct 12, 2021

Yes, I will. I just need to first figure out what's the minimum required version. Apparently 6.x is too old. Did you by any chance try any GCC 7.x versions?

@leesky1c
Copy link

it works with GCC 7.3.0.

@ckyleda
Copy link

ckyleda commented Oct 12, 2021

These "on the fly compiled modules" are nothing but trouble and should be replaced/removed, as they prevent easy replication of results.

Getting this working requires a (potentially difficult, or time-consuming) setup of C compilers that should not be necessary. Not everyone is lucky enough to have sysadmin control of clusters of V100 GPUs.

The code should be architecture-independent.

@StoneCypher
Copy link

@ckyleda - feel free to contribute a dockerfile

@YeHaijia
Copy link

same

@jannehellsten
Copy link
Contributor

These "on the fly compiled modules" are nothing but trouble and should be replaced/removed

We are aware of the difficulties that arise from the use of PyTorch custom extensions. Just like everyone else, we don't like the problems they bring, but the performance benefits are too great for us to forego these optimizations.

In the case of StyleGAN3, our CUDA kernels improve end-to-end training speed by roughly 10x and also reduce the memory footprint very considerably. See Appendix D in our paper for additional details.

In our past projects, the custom extension improvements have been less pronounced but with StyleGAN3 the difference in speed and memory footprint is so large that it’s quite impractical to train this model without them.

(FWIW: We have also explored creating prebuilt binary wheels for these extensions but AFAICT the extension API is not stable enough between PyTorch releases to make this work, leading to even harder to diagnose problems.)

duskvirkus pushed a commit to duskvirkus/stylegan3 that referenced this issue Oct 16, 2021
@chinasilva
Copy link

you can try update ['ninja', '-v'] to (['ninja', '-V'] or ['ninja', '--version'])

un1tz3r0 referenced this issue in un1tz3r0/stylegan3 Dec 10, 2021
Thanks for the PR! Look good for me, thank you for the corrections! Let me know if there's anything else you find!
@mahorton
Copy link

On Windows, I solved this problem by upgrading from Visual Studios 2017 to 2019

@SenhorUnk
Copy link

SenhorUnk commented Dec 28, 2022

I am trying to deploy the NVLabs Superpixel Sampling Network (https://github.com/NVlabs/ssn_superpixels) as a Nuclio (serverless) function to the image annotation tool CVAT. Nuclio dockerizes creates a container with my code inside. Building the container works perfectly fine but when initializing my model handler class inside the container, the pytorch "pair_wise_distance" c++ extension leads to the "ninja: build stopped: subcommand failed."-error.

To you have any further ideas? I am kind of stuck and went to many different suggetions already. Thanks in advance!

My container is based on an "pytorch/pytorch:1.13.1-cuda11.6-cudnn8-runtime"-image.

I am installing following gcc components into my container:
apt-get install -y gcc-8
apt-get install -y g++

Edit:
I solved the problem by using nvidia/cuda:11.6.0-devel-ubuntu20.04 as base image of my Nuclio function. Make sure you have the same Cuda Version installed on the host. The "devel" version of the image includes the Nvidia Cuda Compiler Driver (nvcc) which was missing previously. I didn't have to do any additional GCC installation.

@octadion
Copy link

octadion commented Apr 3, 2023

I'm getting this error, and my gcc version is 9.4.0 is there any solution?

@Msadat97
Copy link
Author

Msadat97 commented Apr 3, 2023

Do you have ninja installed?

@octadion
Copy link

octadion commented Apr 3, 2023

Yes i have ninja installed, so i know whats wrong with mine, it looks like i don't know but my cuda dont have runtime_api so thats why the ninja build stopped. I tried reinstalling torch+cuda but it didnt work

@rohit7044
Copy link

I am facing this issue for a long time(2 days tbh). I tried this but it doesn't work and even if it does, this is not the right answer.
The problem lies with dependency issues. Since it worked on Linux, I believe it's much safer and better to deal with.
Final Verdict: use Linux. It can give you very less issues to deal with.
My configuration:
RTX 4090
Cuda 11.8
gcc 12+
Windows 11

you can try update ['ninja', '-v'] to (['ninja', '-V'] or ['ninja', '--version'])

@hujiaodigua
Copy link

gcc-9.5 same,gcc-11.4 pass

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests