-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
upfirdn2d_plugin Problem #39
Comments
Ok, thanks for filing a separate bug. I’ll keep this one open. There are multiple different problems filed into separate bugs with comments about separate issues added into the same bug. So it gets messy.
Can you give a bit of detail about your project structure? How are you making use of AugmentPipe in your project? Is unmodified stylegan2-ada-pytorch project working for you? |
Hi @nurpax disregard the first line, It was written for something else, I updated my issue with more details. |
Just double checking: your version of stylegan2-ada-pytorch is unmodified and it still does not work? If you run it in Docker, does it work then? Most users have no issue when running in Docker so you should check if that works and report here. (I understand some people don’t like using Docker but it’s good debug info to check if it works or not.) Clearly one of the key problems with these custom extensions is that when something goes wrong in their build or first use, the error message throws away too much information about what exactly went wrong. |
Yes Correct, I haven't made any changes to it. I just this morning cleaned my driver and made a fresh install, created a new anaconda env and downloaded a fresh copy from this repo but the same problem happens. I don't know why. |
I think you've done this step but I'm adding it here for completeness, even if it may sound like I'm just repeating the same thing over and over. The simplest form of getting this error: "Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!" is when there's no ninja installed with I'm mentioning this here as Also: can you please confirm that it works for you in Docker? |
I actually tried both |
I have been dealing with the same problem. when I try to generate it works fine but it is slow and this is my output
then at the end it does generate an image successfully, for training and projecting,
It then get stuck at when I try to project I get this
this continue for few minutes then the kernel dies also. I hope this help. I have tried all the solution proposed in the other issues opened and was not able to get this working. I have read that other people are having the same problem on reddit and no one is sure what's the problem. |
What seems to be happening is that either the extension build somehow fails or the built extension is not able to run somehow. The pytorch code then will try to fallback to a reference implementation that is slower. It looks like this fallback mechanism is not working all too well, as it's trying to build on every invocation. This probably explains why it's so super slow. I'd prefer if we'd find a real fix for this, of course, but here's one thing you could try. You could force the custom ops to always use the slower reference path. This will be slower but it should work. I haven't tried this in a while, but I think you can force the reference implementation by editing the below function (and all the other similar https://github.com/NVlabs/stylegan2-ada-pytorch/blob/main/torch_utils/ops/bias_act.py#L41
to just:
@SBenkara is your repro on Docker or native installation of PyTorch and CUDA? What about the folks on Reddit? |
@nurpax I will try if i could find the reddit post and linked it but most people there were using Windows/Linux and I don't remember seeing Docker related issue. any idea why the extension build is failing? is there any logs i can get that would help? I will try to make the changes you suggested for now until we fix this issue. |
I have some updates hopefully then can help in pinpointing the problem. I forgot to mention that I was using Jupyter notebook. I am not sure what difference it makes but I didn't have any of those issues when I tried using a command line or PyCharm, I just did a pip install and everything started working flawlessly. The problem might be related to either the Jupyter notebook or Anaconda. I made sure to create more environments to make sure that was not a problem with my anaconda env, but they all failed. so I made the changes you suggested, it printed less line of
Edit:
|
@SBenkara @DarXT3mpla4 can you try patching your stylegan2-ada-pytorch code as follows:
I.e., remove try/excepts from around the custom_ops.get_plugin() call. It looks like some exception info is getting lost with the way try/except is written. For example, if I rename my ninja executable in my anaconda3 dirs and rerun with this change, I get a more informative stacktrace. With some luck, maybe this will reveal some new information about the error you are seeing.
|
this is what I am getting now, also it just crash without any output
|
@SBenkara I guess you left the warnings.warn line there? My patch above had that taken out too. Nevertheless, the error is a little more apparent now (emphasis mine): Error building extension 'upfirdn2d_plugin': [1/2] C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\ v11.0 \bin CUDA 11.0 does not support compiling for compute_86 arch, to build for compute_86, you need CUDA 11.1. You can see from above that it's building with CUDA 11.0 nvcc. Another way to verify what compiler versions and flags are actually used, you can check the build.ninja files under |
|
Yes, definitely confirms that 11.0 is being used instead of 11.1. What you will need is to install CUDA 11.1 toolkit from NVIDIA and make sure that you set it up so that 11.1 version comes up first in PATH. E.g., try running "nvcc --version" and check that it's the right version. On my computer this reports something like this:
|
my nvcc --version returns
all my environment variable are pointing to cuda_11.1 I am not understanding where 11.0 is coming from. I used it before but then switch to 11.1 I deleted the bias_act_plugin/build.ninja and tried again and indeed it shows 11.0 I will keep you posted |
@nurpax you were 100% right. even though my nvcc --version was returning 11.1 somehow the 11.0 was being used. I had a both versions installed on my computer but my environment only pointing to the 11.1 Thank you so much! |
I pushed change 2506395 that improves error reporting. Hopefully custom extension build errors get correctly reported now and root causing these problems will be easier. |
Great!
I can’t tell without seeing logs with exception info or build.ninja files for failed attempts. At least in SBenkara’s case, a wrong version of nvcc was chosen. I assume there were multiple CUDA versions in PATH. I don’t know if there are bugs in CUDA tools discovery code in PyTorch. |
@nurpax Jumping on this thread since I think I'm experiencing something related, hope it's ok... I'm training on colab, using the following prompt: The dataset contains only 10 photos, so I'm basically trying transfer learning with small data. I encountered the "Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!" issue at first, which I resolved by installing ninja. However, training speed is still super-slow, and the issue seems to be in the "Evaluating metrics" part. These are the evaluation stats: I'm running on colab, cuda version 11.2, T4 GPU. Thanks in advance! |
@avshalomman Please file separate bugs for separate issues. You can try with --metrics=none, most likely it's computing metrics that'ts taking a long time for you. Closing this bug as both plugin issues seem to have been resolved. |
installing gcc in the linux machine solved the for "No module named 'upfirdn2d_plugin'" for me. check if you have gcc: |
Solve permission issue of `upfirdn2d_plugin` compilation. See NVlabs/stylegan2-ada-pytorch#39 and pytorch/pytorch@1301384 .
work with just install ninja for me. |
I am running on a CentOS platform and got the stylegan2-ada-pytorch notebook to work fine except when it reaches the training stage "python train.py ....". I am getting errors for both bias_act_plugin and upfirdn2d_plugin. I have tried some of the suggestions here but wonder if there is a resolution? Perhaps I am not using the right version of CUDA or Pytorch? My Pytorch is 1.7.1. Here is where the errors and tracebacks begin: Constructing networks... |
Remove ~/.cache/torch_extensions/* if you have installed some new version of torch or torch vision or whatever in between 2 run. Re-run train.py will rebuild those plugins. Took me a couple of hours! Steve |
Thank you Steve !! |
Simply installing ninja solved this for me. I'm on cuda 11.1. |
Hope this helps someone: I solved this issue by installing nvidia-cuda-toolkit (via apt), removed ninja from my pipenv and installed it again. After restarting my jupyter python kernel, the modules where built. |
Sorry, Imma a complete noob. |
Are you familiar with pip? |
Summary of steps I carried out that worked
I actually think it is #3 that worked for me. Next time I ran the python code, it reported that it was installing those two extensions, and all went well. |
Thank you! It is the truth. |
Can I get the windows path for "~/.cache/torch_extensions/*". |
Try |
My problem is: when I use ONE GPU to train, there is not any problems. when I use TWO GPU to train, it comes such problems.
|
I have been working with style gan 2 ada for a couple of weeks and everything worked perfectly fine. However, this morning, the upfird2nd_plugin was not able to build the cuda kernels anymore and got stuck in the "Setting up PyTorch plugin "upfirdn2d_plugin"... " prompt. Deleting the cache files as Steve @thusinh1969 has proposed fixed the issue, thanks a lot :) I am very confused tho on how this problem arrises after having a working implementation and not changing anything. Maybe the build fails with a low probability and when it fails, it causes subsequent builds to fail aswell? |
I do not know if this will help or not, I added my PyTorch version here in the file conv2d_gradfix.py
|
Hopefully this can help someone, but this is how I fixed my error. Firstly I deleted the cache plugin folders > touch_extensions/cache/bias_act_plugin & upfirdn2d_plugin I had multiple CUDA toolkits in PATH (11.2 and 11.8) I had to delete 11.8 and then ran the code again and it worked perfectly. It might be different for you but if you have multiple paths, it could be the issue. OP: #67 (comment) |
In case it helps someone, in my case I actually just had to run it twice for things to work. I ended up here trying to config a project that builds upon this repo (https://github.com/voletiv/mcvd-pytorch), and hitting the same old In my case, the fix (or maybe more of a workaround?) is that i had to run twice. The first time it would throw the error, but the .so was actually generated in the folder, so when running a second time it actually got to run fine. Since i'm running with multiple parallel devices (GPUs), my takeaway is that during the first run the lack of sync led to some worker to not find the .so file while it was still being generated. For the second run onwards, all workers are able to find it properly |
Try |
遇到了相同的问题。我使用的是Linux平台,最终的解决办法是:
|
speaking from the future, I have this problem with cuda 12.1 I will change Cuda version to see if I can make it work |
After reverting back to Ubuntu 20.04 LTS, I've managed to make it work without any problem. I also applied the changes in this PR: #197 For installing Nvidia drivers and CUDA, I followed this: |
It works! |
warnings.warn('Failed to build CUDA kernels for upfirdn2d. Falling back to slow reference implementation. Details:\n\n' + traceback.format_exc()) Traceback (most recent call last): I hope this help. I have tried all the solution proposed in the other issues opened and was not able to get this working. |
Describe the bug
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
Please stop closing people's issues without a confirmed fix for this problem. #2 (comment) does not work and there is no confirmed fix on that issue that was closed without a confirmed fix.
Please be serious about it and let's work together for a fix instead of ignoring the problem and referring people to a close topic that does not offer any solution to their problem.
We tried everything proposed we also tried both Cuda 11.0 and 11.1, with different version of PyTorch just in case.
We are a team of 5 people and we all had the same problem in both Windows and Linux machine and even in google Collab which tells me that this is more than just a configuration problem.
and no
%pip install ninja
did not solve the problem in any of the machines we have in our lab.also, using
verbosity = 'full'
does not seem to include any additional helpful information.Desktop (please complete the following information):
Those are the two machines I used
Machine 1
Machine 2
The text was updated successfully, but these errors were encountered: