Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash in CUDA 11 #698

Closed
csukuangfj opened this issue Mar 26, 2021 · 8 comments · Fixed by #699
Closed

Crash in CUDA 11 #698

csukuangfj opened this issue Mar 26, 2021 · 8 comments · Fixed by #699

Comments

@csukuangfj
Copy link
Collaborator

csukuangfj commented Mar 26, 2021

The following minimal demo will crash for CUDA 11 + Python 3.9 + PyTorch 1.7.1 + latest k2 (for both Debug & Release build)

When it runs with cuda-memcheck, the process just hangs and seems never to terminate.

When running with cuda-memcheck, it prints no errors after the crash.

The reason for the crash is similar to the one mentioned in #696 (comment)

Note that k2.closure uses torch.nonzero internally.

minus_one_index = torch.nonzero(src_aux_labels == -1, as_tuple=False)

The crash in snowfall also uses torch.nonzero https://github.com/k2-fsa/snowfall/blob/4a909a3a609d5a3444b14fc40d779f217e1263c1/egs/librispeech/asr/simple_v1/mmi_bigram_train.py#L62

finite_indexes = torch.nonzero(mask).squeeze(1)

It will not crash for the same code built with CUDA 10.1

Crash log

Traceback (most recent call last):
  File "/ceph-fj/open-source/k2/build_debug_11/./ab.py", line 23, in <module>
    main()
  File "/ceph-fj/open-source/k2/build_debug_11/./ab.py", line 19, in main
    ans = k2.closure(fsa)
  File "/root/fangjun/open-source/k2/k2/python/k2/fsa_algo.py", line 677, in closure
    new_value = fix_aux_labels(value, fsa.arcs.row_splits(1), arc_map)
  File "/root/fangjun/open-source/k2/k2/python/k2/fsa_algo.py", line 663, in fix_aux_labels
    minus_one_index[minus_one_index > src_start_state_last_arc_index] += 1
RuntimeError: invalid shape dimension -16711680

Demo

#!/usr/bin/env python3

import torch
import k2


def main():
    device = torch.device('cuda', 0)

    s = '''
        0 1 1 0.1
        1 2 2 0.2
        2 3 -1 0.3
        3
    '''
    fsa = k2.Fsa.from_str(s).to(device).requires_grad_(True)

    fsa.aux_labels = torch.tensor([10, 20, -1], dtype=torch.int32).to(device)
    ans = k2.closure(fsa)


if __name__ == '__main__':
    main()
@csukuangfj
Copy link
Collaborator Author

If I change

minus_one_index = torch.nonzero(src_aux_labels == -1, as_tuple=False)

to

        print('src shape', src_aux_labels.shape)
        minus_one_index = torch.nonzero(src_aux_labels == -1, as_tuple=False)
        print('minus one shape', minus_one_index.shape)

It prints

src shape torch.Size([3])
minus one shape torch.Size([16843009, 1])

You can see that the shape size 16843009 is extremely large.

@danpovey
Copy link
Collaborator

My suspicion is that this is a build-system issue. You tend to get these kinds of mysterious errors when different compilation units have different ideas about the layouts of C++ objects.

We have a number of .cu files that include, directly or indirectly, torch.h, and from this
https://pytorch.org/tutorials/advanced/cpp_extension.html
it's not clear to me whether you are supposed to include torch.h in CUDA code
(search for torch.h in that page).

Also there may be compilation flags that are mismatched, e.g. we use
-D_GLIBCXX_USE_CXX11_ABI=0 and I'm not sure how Torch was compiled.

@csukuangfj
Copy link
Collaborator Author

csukuangfj commented Mar 26, 2021

Also there may be compilation flags that are mismatched, e.g. we use
-D_GLIBCXX_USE_CXX11_ABI=0 and I'm not sure how Torch was compiled.

This flag is copied from PyTorch. See

k2/cmake/torch.cmake

Lines 13 to 15 in 08a2bfe

# set the global CMAKE_CXX_FLAGS so that
# k2 uses the same abi flag as PyTorch
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${TORCH_CXX_FLAGS}")

If k2 uses a different flag from what PyTorch is using, then it will have trouble at the link time. The linking stage will fail.

@csukuangfj
Copy link
Collaborator Author

We have a number of .cu files that include, directly or indirectly, torch.h, and from this
https://pytorch.org/tutorials/advanced/cpp_extension.html
it's not clear to me whether you are supposed to include torch.h in CUDA code
(search for torch.h in that page).

Here is what torch.h looks like:
https://github.com/pytorch/pytorch/blob/master/torch/csrc/api/include/torch/torch.h

  
#pragma once

#include <torch/all.h>

#ifdef TORCH_API_INCLUDE_EXTENSION_H
#include <torch/extension.h>

#endif // defined(TORCH_API_INCLUDE_EXTENSION_H)

And extension.h:
https://github.com/pytorch/pytorch/blob/master/torch/extension.h

#pragma once

// All pure C++ headers for the C++ frontend.
#include <torch/all.h>
// Python bindings for the C++ frontend (includes Python.h).
#include <torch/python.h>

TORCH_API_INCLUDE_EXTENSION_H is defined in k2 at the place

add_definitions(-DTORCH_API_INCLUDE_EXTENSION_H)

Currently, that macro is defined only in k2/python/csrc. I should have moved it to the top-level CMakeLists.txt.


I am moving it.

@danpovey
Copy link
Collaborator

Incidentally, when I use pytorch's setup.py things to compile an example torch extension with C++, the following is what the compilation flags look like. This is just FYI; I don't see anything in there that looks important.

/usr/local/cuda/bin/nvcc -I/ceph-fj/fangjun/py39/lib/python3.9/site-packages/torch/include -I/ceph-fj/fangjun/py39/lib/python3.9/site-packages/torch/include/torch/csrc/api/\
include -I/ceph-fj/fangjun/py39/lib/python3.9/site-packages/torch/include/TH -I/ceph-fj/fangjun/py39/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include\
 -I/root/fangjun/open-source/pyenv/versions/3.9.0/include/python3.9 -c lltm_cuda_kernel.cu -o build/temp.linux-x86_64-3.9/lltm_cuda_kernel.o -D__CUDA_NO_HALF_OPERATORS__ -D\
__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options '-fPIC' -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="_gc\
c" -DPYBIND11_STDLIB="_libstdcpp" -DPYBIND11_BUILD_ABI="_cxxabi1011" -DTORCH_EXTENSION_NAME=lltm_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_70,code=sm_70 -std=c+\
+14

@csukuangfj
Copy link
Collaborator Author

This is what I got about half a year ago. Seems that it changes only the build options for Pybind11.

hello world (cpu)
-----------------

.. code-block:: cpp

  :caption: hello.cc

  #include <torch/extension.h>

  torch::Tensor sigmoid(torch::Tensor z) {
    auto s = torch::sigmoid(z);
    return (1 - s) * s;
  }

  PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
    m.def("sigmoid", &sigmoid, "sigmoid test");
  }

.. code-block:: python

  :caption: setup.py

  from setuptools import setup, Extension
  from torch.utils import cpp_extension

  setup(name='hello',
        ext_modules=[cpp_extension.CppExtension('hello', ['hello.cc'])],
        cmdclass={'build_ext': cpp_extension.BuildExtension.with_options(use_ninja=False)})

The output of ``python setup.py build``::

    running build
    running build_ext
    building 'hello' extension
    creating build
    creating build/temp.linux-x86_64-3.7
    gcc -pthread -Wno-unused-result -Wsign-compare \
      -DNDEBUG -g -fwrapv -O3 -Wall -fPIC \
      -I/xxx/py37/lib/python3.7/site-packages/torch/include \
      -I/xxx/py37/lib/python3.7/site-packages/torch/include/torch/csrc/api/include \
      -I/xxx/py37/lib/python3.7/site-packages/torch/include/TH \
      -I/xxx/py37/lib/python3.7/site-packages/torch/include/THC \
      -I/xxx/include/python3.7m \
      -c hello.cc \
      -o build/temp.linux-x86_64-3.7/hello.o \
      -DTORCH_API_INCLUDE_EXTENSION_H \
      -DTORCH_EXTENSION_NAME=hello \
      -D_GLIBCXX_USE_CXX11_ABI=0 \
      -std=c++14

    g++ \
      -pthread \
      -shared \
      -L/xxx/lib \
      build/temp.linux-x86_64-3.7/hello.o \
      -L/xxx/py37/lib/python3.7/site-packages/torch/lib \
      -lc10 \
      -ltorch \
      -ltorch_cpu \
      -ltorch_python \
      -o build/lib.linux-x86_64-3.7/hello.cpython-37m-x86_64-linux-gnu.so

hello world (cuda)
------------------

Change ``setup.py``. Replace ``cpp_extension.CppExtension`` with ``cpp_extension.CUDAExtension``.

The output of ``python setup.py build``::

    -I/usr/local/cuda/include


    -L/usr/local/cuda/lib64 \
    -lcudart \
    -lc10_cuda \
    -ltorch_cuda

@danpovey
Copy link
Collaborator

This issue in PyTorch has been active recently
pytorch/pytorch#54245
and notes certain problems with thrust and cub in CUDA 11; and the implementation of torch.nonzero
https://github.com/pytorch/pytorch/blob/f6634be4c2b72e0d8da46d5992facb59b55a90bc/aten/src/ATen/native/cuda/Indexing.cu#L861
does use cub (and also thrust, but only when the result has ndim > 1, which is not the case in our
minimal example).

Here NVIDIA/thrust#1401 it's mentioned that in certain circumstances people use a prefix to separate the symbols of cub when two different libraries that are loaded at the same time use cub. I checked the symbols from us and torch, and they do declare some identical symbols, e.g. these:

000000000370e618 u _ZGVZN3cub22DeviceCountCachedValueEvE5cache
000000000370ec20 u _ZGVZN3cub26GetPerDeviceAttributeCacheINS_18PtxVersionCacheTagEEERNS_23PerDeviceAttributeCacheEvE5cache

but it's not absolutely clear to me that this is a problem.
It's possible that the issue relates to both torch and us using cub...

csukuangfj added a commit to csukuangfj/k2 that referenced this issue Mar 27, 2021
@csukuangfj
Copy link
Collaborator Author

but it's not absolutely clear to me that this is a problem.

Yes, that is the key.

Fixed in #699

danpovey pushed a commit that referenced this issue Mar 31, 2021
… is set (#699)

* Free CUDA memory in a correct way when PYTORCH_NO_CUDA_MEMORY_CACHING is set

* fix a typo.

* add more comments.

* Fix after review.

* Fix typos.

* fix typos.

* Fix crashes in CUDA11 due to CUB.

See #698

* Fix typos.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants