-
Notifications
You must be signed in to change notification settings - Fork 217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crash in CUDA 11 #698
Comments
If I change Line 660 in 08a2bfe
to print('src shape', src_aux_labels.shape)
minus_one_index = torch.nonzero(src_aux_labels == -1, as_tuple=False)
print('minus one shape', minus_one_index.shape) It prints
You can see that the shape size |
My suspicion is that this is a build-system issue. You tend to get these kinds of mysterious errors when different compilation units have different ideas about the layouts of C++ objects. We have a number of .cu files that include, directly or indirectly, torch.h, and from this Also there may be compilation flags that are mismatched, e.g. we use |
This flag is copied from PyTorch. See Lines 13 to 15 in 08a2bfe
If k2 uses a different flag from what PyTorch is using, then it will have trouble at the link time. The linking stage will fail. |
Here is what
#pragma once
#include <torch/all.h>
#ifdef TORCH_API_INCLUDE_EXTENSION_H
#include <torch/extension.h>
#endif // defined(TORCH_API_INCLUDE_EXTENSION_H) And #pragma once
// All pure C++ headers for the C++ frontend.
#include <torch/all.h>
// Python bindings for the C++ frontend (includes Python.h).
#include <torch/python.h>
k2/k2/python/csrc/CMakeLists.txt Line 9 in 08a2bfe
Currently, that macro is defined only in I am moving it. |
Incidentally, when I use pytorch's setup.py things to compile an example torch extension with C++, the following is what the compilation flags look like. This is just FYI; I don't see anything in there that looks important.
|
This is what I got about half a year ago. Seems that it changes only the build options for Pybind11.
|
This issue in PyTorch has been active recently Here NVIDIA/thrust#1401 it's mentioned that in certain circumstances people use a prefix to separate the symbols of cub when two different libraries that are loaded at the same time use cub. I checked the symbols from us and torch, and they do declare some identical symbols, e.g. these:
but it's not absolutely clear to me that this is a problem. |
Yes, that is the key. Fixed in #699 |
The following minimal demo will crash for CUDA 11 + Python 3.9 + PyTorch 1.7.1 + latest k2 (for both Debug & Release build)
When it runs withcuda-memcheck
, the process just hangs and seems never to terminate.When running with
cuda-memcheck
, it prints no errors after the crash.The reason for the crash is similar to the one mentioned in #696 (comment)
Note that
k2.closure
usestorch.nonzero
internally.k2/k2/python/k2/fsa_algo.py
Line 660 in 08a2bfe
The crash in snowfall also uses
torch.nonzero
https://github.com/k2-fsa/snowfall/blob/4a909a3a609d5a3444b14fc40d779f217e1263c1/egs/librispeech/asr/simple_v1/mmi_bigram_train.py#L62It will not crash for the same code built with CUDA 10.1
Crash log
Demo
The text was updated successfully, but these errors were encountered: