Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash "invalid device function" when running snowfall recipe #696

Closed
RuABraun opened this issue Mar 24, 2021 · 30 comments
Closed

Crash "invalid device function" when running snowfall recipe #696

RuABraun opened this issue Mar 24, 2021 · 30 comments

Comments

@RuABraun
Copy link

Hi,
I want to run snowfall's librispeech recipe. I'm getting a crash when training starts that seems to have to do with k2:

# python -m torch.distributed.launch --nproc_per_node=1 ./mmi_bigram_train.py --world_size 1
World size: 1 Rank: 0                                                                                                                                                                                     
[F] /idiap/temp/rbraun/code/k2/k2/csrc/eval.h:134:void k2::EvalDevice(cudaStream_t, int32_t, LambdaT&) [with LambdaT = __nv_dl_wrapper_t<__nv_dl_tag<k2::RaggedShape (*)(const k2::RaggedShape&, int), k2::
Unsqueeze, 1>, int*, int>; cudaStream_t = CUstream_st*; int32_t = int] Check failed: e == cudaSuccess (98 vs. 0)  Error: invalid device function.                                                         
                                                        
                                     
[ Stack-Trace: ]                  
/idiap/temp/rbraun/code/k2/build/lib/libk2_log.so(k2::internal::GetStackTrace()+0x34) [0x7fdc7e038054]                                                                                                    
/idiap/temp/rbraun/code/k2/build/lib/libk2context.so(k2::internal::Logger::~Logger()+0x2a) [0x7fdc7e1ab3ca]
/idiap/temp/rbraun/code/k2/build/lib/libk2context.so(void k2::EvalDevice<__nv_dl_wrapper_t<__nv_dl_tag<k2::RaggedShape (*)(k2::RaggedShape const&, int), &k2::Unsqueeze, 1u>, int*, int> >(CUstream_st*, in
t, __nv_dl_wrapper_t<__nv_dl_tag<k2::RaggedShape (*)(k2::RaggedShape const&, int), &k2::Unsqueeze, 1u>, int*, int>&)+0x28a) [0x7fdc7e325a3a]
/idiap/temp/rbraun/code/k2/build/lib/libk2context.so(k2::Unsqueeze(k2::RaggedShape const&, int)+0x74e) [0x7fdc7e303c0e]                                                                                   
/idiap/temp/rbraun/code/k2/build/lib/libk2context.so(k2::FsaToFsaVec(k2::Ragged<k2::Arc> const&)+0x1fc) [0x7fdc7e1eceec]
/idiap/temp/rbraun/code/k2/build/lib/libk2context.so(k2::GetFsaBasicProperties(k2::Ragged<k2::Arc> const&)+0x50) [0x7fdc7e1efdd0]                                                                         
/idiap/temp/rbraun/programs/anaconda3/envs/speech/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0x46217) [0x7fdc8589d217]
/idiap/temp/rbraun/programs/anaconda3/envs/speech/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0x1c3d6) [0x7fdc858733d6]                                                               
/idiap/temp/rbraun/programs/anaconda3/envs/speech/bin/python(PyCFunction_Call+0x56) [0x56134dff8f76]
/idiap/temp/rbraun/programs/anaconda3/envs/speech/bin/python(_PyObject_MakeTpCall+0x22f) [0x56134dfb685f]                                                                                                  /idiap/temp/rbraun/programs/anaconda3/envs/speech/bin/python(_PyEval_EvalFrameDefault+0x4596) [0x56134e03df56]
/idiap/temp/rbraun/programs/anaconda3/envs/speech/bin/python(+0x18c11a) [0x56134e00511a]
/idiap/temp/rbraun/programs/anaconda3/envs/speech/bin/python(_PyObject_GenericGetAttrWithDict+0x135) [0x56134dfb7f65]
/idiap/temp/rbraun/programs/anaconda3/envs/speech/bin/python(+0x17e733) [0x56134dff7733]
/idiap/temp/rbraun/programs/anaconda3/envs/speech/bin/python(_PyEval_EvalFrameDefault+0x96c) [0x56134e03a32c]
/idiap/temp/rbraun/programs/anaconda3/envs/speech/bin/python(+0x18bc0b) [0x56134e004c0b]
/idiap/temp/rbraun/programs/anaconda3/envs/speech/bin/python(+0x10077f) [0x56134df7977f]
/idiap/temp/rbraun/programs/anaconda3/envs/speech/bin/python(_PyEval_EvalCodeWithName+0x2d2) [0x56134e003a92]
/idiap/temp/rbraun/programs/anaconda3/envs/speech/bin/python(_PyFunction_Vectorcall+0x1e3) [0x56134e004943]
/idiap/temp/rbraun/programs/anaconda3/envs/speech/bin/python(_PyObject_FastCallDict+0x24b) [0x56134e0054cb]
/idiap/temp/rbraun/programs/anaconda3/envs/speech/bin/python(_PyObject_Call_Prepend+0x63) [0x56134e005733]
/idiap/temp/rbraun/programs/anaconda3/envs/speech/bin/python(+0x18c8ca) [0x56134e0058ca]
/idiap/temp/rbraun/programs/anaconda3/envs/speech/bin/python(_PyObject_MakeTpCall+0x1a4) [0x56134dfb67d4]
/idiap/temp/rbraun/programs/anaconda3/envs/speech/bin/python(_PyEval_EvalFrameDefault+0x11d0) [0x56134e03ab90]
/idiap/temp/rbraun/programs/anaconda3/envs/speech/bin/python(_PyFunction_Vectorcall+0x10b) [0x56134e00486b]
/idiap/temp/rbraun/programs/anaconda3/envs/speech/bin/python(+0xfeb84) [0x56134df77b84]
/idiap/temp/rbraun/programs/anaconda3/envs/speech/bin/python(_PyEval_EvalCodeWithName+0x2d2) [0x56134e003a92]
/idiap/temp/rbraun/programs/anaconda3/envs/speech/bin/python(PyEval_EvalCodeEx+0x44) [0x56134e004754]
/idiap/temp/rbraun/programs/anaconda3/envs/speech/bin/python(PyEval_EvalCode+0x1c) [0x56134e092edc]
/idiap/temp/rbraun/programs/anaconda3/envs/speech/bin/python(+0x219f84) [0x56134e092f84]
/idiap/temp/rbraun/programs/anaconda3/envs/speech/bin/python(+0x24c1f4) [0x56134e0c51f4]
/idiap/temp/rbraun/programs/anaconda3/envs/speech/bin/python(PyRun_FileExFlags+0xa1) [0x56134df8d6e1]
/idiap/temp/rbraun/programs/anaconda3/envs/speech/bin/python(PyRun_SimpleFileExFlags+0x3b4) [0x56134df8dac6]
/idiap/temp/rbraun/programs/anaconda3/envs/speech/bin/python(+0x11598b) [0x56134df8e98b]
/idiap/temp/rbraun/programs/anaconda3/envs/speech/bin/python(Py_BytesMain+0x39) [0x56134e0c7d19]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xeb) [0x7fde4986d09b]
/idiap/temp/rbraun/programs/anaconda3/envs/spTraceback (most recent call last):
  File "./mmi_bigram_train.py", line 558, in <module>
    main()
  File "./mmi_bigram_train.py", line 338, in main
    graph_compiler = MmiTrainingGraphCompiler(
  File "/remote/idiap.svm/temp.speech01/rbraun/code/snowfall/snowfall/training/mmi_graph.py", line 84, in __init__
    self.ctc_topo_inv = k2.arc_sort(ctc_topo.invert_())
  File "/idiap/temp/rbraun/programs/anaconda3/envs/speech/lib/python3.8/site-packages/k2/fsa.py", line 816, in invert_
    self.properties
  File "/idiap/temp/rbraun/programs/anaconda3/envs/speech/lib/python3.8/site-packages/k2/fsa.py", line 367, in properties
    properties = _k2.get_fsa_basic_properties(self.arcs)
RuntimeError: Some bad things happed.
Traceback (most recent call last):
  File "/idiap/temp/rbraun/programs/anaconda3/envs/speech/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/idiap/temp/rbraun/programs/anaconda3/envs/speech/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/idiap/temp/rbraun/programs/anaconda3/envs/speech/lib/python3.8/site-packages/torch/distributed/launch.py", line 340, in <module>
    main()
  File "/idiap/temp/rbraun/programs/anaconda3/envs/speech/lib/python3.8/site-packages/torch/distributed/launch.py", line 326, in main
    sigkill_handler(signal.SIGTERM, None)  # not coming back
  File "/idiap/temp/rbraun/programs/anaconda3/envs/speech/lib/python3.8/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
    raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/idiap/temp/rbraun/programs/anaconda3/envs/speech/bin/python', '-u', './mmi_bigram_train.py', '--local_rank=0', '--world_size', '1']' returned non-zero exit stat
us 1.
eech/bin/python(+0x1dee93) [0x56134e057e93]

This is the output of pytorch's collect_env:

Collecting environment information...
PyTorch version: 1.8.0
Is debug build: False
CUDA used to build PyTorch: 11.1
ROCM used to build PyTorch: N/A

OS: Debian GNU/Linux 10 (buster) (x86_64)
GCC version: (Debian 8.3.0-6) 8.3.0
Clang version: 7.0.1-8+deb10u2 (tags/RELEASE_701/final)
CMake version: version 3.13.4

Python version: 3.8 (64-bit runtime)
Is CUDA available: False
CUDA runtime version: 11.1.74
GPU models and configuration: GPU 0: GeForce GT 1030
Nvidia driver version: 455.32.00
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.20.1
[pip3] torch==1.8.0
[pip3] torchaudio==0.8.0a0+a751e1d
[pip3] torchvision==0.9.0
[conda] blas                      1.0                         mkl
[conda] cudatoolkit               11.1.1               h6406543_8    conda-forge
[conda] ffmpeg                    4.3                  hf484d3e_0    pytorch
[conda] mkl                       2020.2                      256
[conda] mkl-service               2.3.0            py38he904b0f_0
[conda] mkl_fft                   1.3.0            py38h54f3939_0
[conda] mkl_random                1.1.1            py38h0573a6f_0
[conda] numpy                     1.20.1                   pypi_0    pypi
[conda] pytorch                   1.8.0           py3.8_cuda11.1_cudnn8.0.5_0    pytorch
[conda] torchaudio                0.8.0                      py38    pytorch
[conda] torchvision               0.9.0                py38_cu111    pytorch

The GPU is a RTX 3090. I built k2 for release as described in the documentation, the version information:

(speech) rbraun@italix29:/idiap/temp/rbraun/work$ python -m pip freeze | grep k2
k2 @ file:///remote/idiap.svm/temp.speech01/rbraun/code/k2/dist/k2-0.3.3%2Bcu111.dev20210323-cp38-cp38-linux_x86_64.whl
-e git+https://github.com/k2-fsa/snowfall.git@4a909a3a609d5a3444b14fc40d779f217e1263c1#egg=snowfall

Maybe I should be restricting myself to CUDA 10.2 ?

@csukuangfj
Copy link
Collaborator

Could you post the output of

python3 -m k2.version

@RuABraun
Copy link
Author

(speech) rbraun@italix29:/idiap/temp/rbraun/code/snowfall/egs/librispeech/asr/simple_v1$ python -m k2.version
Collecting environment information...

k2 version: 0.3.3
Build type: Release
Git SHA1: 1efe57fc9ef6b88ce435f2e2123624b18fd7ffef
Git date: Tue Mar 23 09:49:54 2021
Cuda used to build k2: 11.1
cuDNN used to build k2: 8.0.0
Python version used to build k2: 3.8
OS used to build k2: Debian GNU/Linux 10 (buster)
CMake version: 3.13.4
GCC version: 8.3.0
CMAKE_CUDA_FLAGS:  -D_GLIBCXX_USE_CXX11_ABI=0  --expt-extended-lambda -gencode arch=compute_35,code=sm_35 --expt-extended-lambda -gencode arch=compute_50,code=sm_50 --expt-extended-lambda -gencode arch=compute_60,code=sm_60 --expt-extended-lambda -gencode arch=compute_61,code=sm_61 --expt-extended-lambda -gencode arch=compute_70,code=sm_70 --expt-extended-lambda -gencode arch=compute_75,code=sm_75 --compiler-options -Wall --compiler-options -Wno-unknown-pragmas
CMAKE_CXX_FLAGS:  -D_GLIBCXX_USE_CXX11_ABI=0
PyTorch version used to build k2: 1.8.0
PyTorch is using Cuda: 11.1
NVTX enabled: True
Disable debug: True
Sync kernels : False
Disable checks: False

@csukuangfj
Copy link
Collaborator

csukuangfj commented Mar 24, 2021


(speech) rbraun@italix29:/idiap/temp/rbraun/code/snowfall/egs/librispeech/asr/simple_v1$ python -m k2.version

Collecting environment information...



k2 version: 0.3.3

Build type: Release

Git SHA1: 1efe57fc9ef6b88ce435f2e2123624b18fd7ffef

Git date: Tue Mar 23 09:49:54 2021

Cuda used to build k2: 11.1

cuDNN used to build k2: 8.0.0

Python version used to build k2: 3.8

OS used to build k2: Debian GNU/Linux 10 (buster)

CMake version: 3.13.4

GCC version: 8.3.0

CMAKE_CUDA_FLAGS:  -D_GLIBCXX_USE_CXX11_ABI=0  --expt-extended-lambda -gencode arch=compute_35,code=sm_35 --expt-extended-lambda -gencode arch=compute_50,code=sm_50 --expt-extended-lambda -gencode arch=compute_60,code=sm_60 --expt-extended-lambda -gencode arch=compute_61,code=sm_61 --expt-extended-lambda -gencode arch=compute_70,code=sm_70 --expt-extended-lambda -gencode arch=compute_75,code=sm_75 --compiler-options -Wall --compiler-options -Wno-unknown-pragmas

CMAKE_CXX_FLAGS:  -D_GLIBCXX_USE_CXX11_ABI=0

PyTorch version used to build k2: 1.8.0

PyTorch is using Cuda: 11.1

NVTX enabled: True

Disable debug: True

Sync kernels : False

Disable checks: False

Does appending 86 after 75 at line 113 in CMakeLists.txt and recompiling k2 solve the problem?

@RuABraun
Copy link
Author

Ah thanks I should have thought of that. That seems to have been it getting a different error now.

@csukuangfj
Copy link
Collaborator

Ah thanks I should have thought of that. That seems to have been it getting a different error now.

What's the new error?

@RuABraun
Copy link
Author

Got several different exceptions but I think I can figure them out (one was about file in use, another GPU OOM). I'll make a comment if I get stuck. :)

@RuABraun
Copy link
Author

RuABraun commented Mar 24, 2021

Okay I don't know what is happening 😆

Traceback (most recent call last):
Traceback (most recent call last):
  File "/idiap/temp/rbraun/programs/anaconda3/envs/speech/lib/python3.8/multiprocessing/util.py", line 300, in _run_finalizers
    finalizer()
  File "/idiap/temp/rbraun/programs/anaconda3/envs/speech/lib/python3.8/multiprocessing/util.py", line 224, in __call__
    res = self._callback(*self._args, **self._kwargs)
  File "/idiap/temp/rbraun/programs/anaconda3/envs/speech/lib/python3.8/multiprocessing/util.py", line 133, in _remove_temp_dir
    rmtree(tempdir)
  File "/idiap/temp/rbraun/programs/anaconda3/envs/speech/lib/python3.8/multiprocessing/util.py", line 300, in _run_finalizers
    finalizer()
  File "/idiap/temp/rbraun/programs/anaconda3/envs/speech/lib/python3.8/shutil.py", line 715, in rmtree
    _rmtree_safe_fd(fd, path, onerror)
  File "/idiap/temp/rbraun/programs/anaconda3/envs/speech/lib/python3.8/multiprocessing/util.py", line 224, in __call__
    res = self._callback(*self._args, **self._kwargs)
  File "/idiap/temp/rbraun/programs/anaconda3/envs/speech/lib/python3.8/shutil.py", line 672, in _rmtree_safe_fd
    onerror(os.unlink, fullname, sys.exc_info())
  File "/idiap/temp/rbraun/programs/anaconda3/envs/speech/lib/python3.8/multiprocessing/util.py", line 133, in _remove_temp_dir
    rmtree(tempdir)
  File "/idiap/temp/rbraun/programs/anaconda3/envs/speech/lib/python3.8/shutil.py", line 670, in _rmtree_safe_fd
    os.unlink(entry.name, dir_fd=topfd)
  File "/idiap/temp/rbraun/programs/anaconda3/envs/speech/lib/python3.8/shutil.py", line 715, in rmtree
    _rmtree_safe_fd(fd, path, onerror)
  File "/idiap/temp/rbraun/programs/anaconda3/envs/speech/lib/python3.8/shutil.py", line 672, in _rmtree_safe_fd
    onerror(os.unlink, fullname, sys.exc_info())
  File "/idiap/temp/rbraun/programs/anaconda3/envs/speech/lib/python3.8/shutil.py", line 670, in _rmtree_safe_fd
    os.unlink(entry.name, dir_fd=topfd)
OSError: [Errno 16] Device or resource busy: '.nfs000000000410c84600007123'
OSError: [Errno 16] Device or resource busy: '.nfs000000000410c84400007122'
Traceback (most recent call last):
  File "/idiap/temp/rbraun/programs/anaconda3/envs/speech/lib/python3.8/multiprocessing/util.py", line 300, in _run_finalizers
    finalizer()
  File "/idiap/temp/rbraun/programs/anaconda3/envs/speech/lib/python3.8/multiprocessing/util.py", line 224, in __call__
    res = self._callback(*self._args, **self._kwargs)
  File "/idiap/temp/rbraun/programs/anaconda3/envs/speech/lib/python3.8/multiprocessing/util.py", line 133, in _remove_temp_dir
    rmtree(tempdir)
  File "/idiap/temp/rbraun/programs/anaconda3/envs/speech/lib/python3.8/shutil.py", line 715, in rmtree
    _rmtree_safe_fd(fd, path, onerror)
  File "/idiap/temp/rbraun/programs/anaconda3/envs/speech/lib/python3.8/shutil.py", line 672, in _rmtree_safe_fd
    onerror(os.unlink, fullname, sys.exc_info())
  File "/idiap/temp/rbraun/programs/anaconda3/envs/speech/lib/python3.8/shutil.py", line 670, in _rmtree_safe_fd
    os.unlink(entry.name, dir_fd=topfd)
OSError: [Errno 16] Device or resource busy: '.nfs000000000410c84800007124'
Traceback (most recent call last):
  File "./mmi_bigram_train.py", line 558, in <module>
    main()
  File "./mmi_bigram_train.py", line 486, in main
    objf, valid_objf, global_batch_idx_train = train_one_epoch(
  File "./mmi_bigram_train.py", line 255, in train_one_epoch
    total_objf / total_frames, total_frames,
ZeroDivisionError: float division by zero

The files in exp/data are nonzero size so I can't see an obvious reason why there is a ZeroDivisionError. No idea about the file in use.

@pzelasko
Copy link
Contributor

I think that if total_frames is zero that implies something went wrong during the loss computation (ok_frames was zero), and the .nfs errors seem to be something that happened after this error. You might want to double-check that the batch looks reasonable (in a jupyter notebook you can do - from lhotse.dataset.vis import plot_batch and plot_batch(batch)); if the data is OK then I guess @danpovey and @csukuangfj can help you with the loss part better.

@RuABraun
Copy link
Author

So previously I was running via SGE. I tried without and get an OOM memory. I want to reduce the batch size (and confirm that it runs fine without SGE), but I can't figure out how to do that @pzelasko ? I tried changing train_sampler = BucketingSampler(.., max_frames=) to 20000 (halfing it), but that results in an error:

Traceback (most recent call last):
  File "./mmi_bigram_train.py", line 559, in <module>
    main()
  File "./mmi_bigram_train.py", line 487, in main
    objf, valid_objf, global_batch_idx_train = train_one_epoch(
  File "./mmi_bigram_train.py", line 232, in train_one_epoch
    curr_batch_objf, curr_batch_frames, curr_batch_all_frames = get_objf(
  File "./mmi_bigram_train.py", line 146, in get_objf
    all_frames) = get_tot_objf_and_num_frames(tot_scores,
  File "./mmi_bigram_train.py", line 71, in get_tot_objf_and_num_frames
    ok_frames = frames_per_seq[finite_indexes].sum()
IndexError: index 2203318223360 is out of bounds for dimension 0 with size 25

Don't know what else I could do to reduce it since the argument to batch_size of the Dataloader is currently None...

@pzelasko
Copy link
Contributor

You figured it out just fine :) the batch_size is dynamic and controlled directly via max_frames/max_duration in samplers (they are actually "batch samplers"). The issue seems to lie somewhere with the loss computation I think. Is this CUDA 11? I had issues with that once, I remember using CUDA 10.2 instead fixed them for me.

@pzelasko
Copy link
Contributor

Actually I might have run into the same issue as you -- not 100% sure tho, it was some time ago.

@RuABraun
Copy link
Author

Yes it's CUDA 11. Thanks for the tip I'll change to 10.2 then!

@danpovey
Copy link
Collaborator

danpovey commented Mar 25, 2021 via email

@RuABraun
Copy link
Author

RuABraun commented Mar 25, 2021

I tried using 10.2, here's the k2 version info:

(ten) rbraun@italix29:/idiap/temp/rbraun/code$ python -m k2.version
Collecting environment information...

k2 version: 0.3.3
Build type: Release
Git SHA1: 1efe57fc9ef6b88ce435f2e2123624b18fd7ffef
Git date: Tue Mar 23 09:49:54 2021
Cuda used to build k2: 10.2
cuDNN used to build k2: 8.0.0
Python version used to build k2: 3.8
OS used to build k2: Debian GNU/Linux 10 (buster)
CMake version: 3.13.4
GCC version: 8.3.0
CMAKE_CUDA_FLAGS:  -D_GLIBCXX_USE_CXX11_ABI=0  --expt-extended-lambda -gencode arch=compute_35,code=sm_35 --expt-extended-lambda -gencode arch=compute_50,code=sm_50 --expt-extended-lambda -gencode arch=compute_60,code=sm_60 --expt-extended-lambda -gencode arch=compute_61,code=sm_61 --expt-extended-lambda -gencode arch=compute_70,code=sm_70 --expt-extended-lambda -gencode arch=compute_75,code=sm_75 --compiler-options -Wall --compiler-options -Wno-unknown-pragmas                 
CMAKE_CXX_FLAGS:  -D_GLIBCXX_USE_CXX11_ABI=0
PyTorch version used to build k2: 1.8.0
PyTorch is using Cuda: 10.2
NVTX enabled: True
Disable debug: True
Sync kernels : False
Disable checks: False

But I got a different error:

[F] /idiap/temp/rbraun/code/k2/k2/csrc/eval.h:134:void k2::EvalDevice(cudaStream_t, int32_t, LambdaT&) [with LambdaT = __nv_dl_wrapper_t<__nv_dl_tag<voi[30/2460]
ray1<char>::*)(char), &k2::Array1<char>::operator=, 1>, char*, const char>; cudaStream_t = CUstream_st*; int32_t = int] Check failed: e == cudaSuccess (700 vs. 0
)  Error: an illegal memory access was encountered.


[ Stack-Trace: ]
/idiap/temp/rbraun/code/k2/build/lib/libk2_log.so(k2::internal::GetStackTrace()+0x34) [0x7f4ab354d084]                                                          
/idiap/temp/rbraun/code/k2/build/lib/libk2context.so(k2::internal::Logger::~Logger()+0x2a) [0x7f4ab36be78a]                                                     
/idiap/temp/rbraun/code/k2/build/lib/libk2context.so(void k2::EvalDevice<__nv_dl_wrapper_t<__nv_dl_tag<void (k2::Array1<char>::*)(char), &k2::Array1<char>::opera
tor=, 1u>, char*, char const> >(CUstream_st*, int, __nv_dl_wrapper_t<__nv_dl_tag<void (k2::Array1<char>::*)(char), &k2::Array1<char>::operator=, 1u>, char*, char
 const>&)+0x16c) [0x7f4ab370e65c]                                                                                                                               
/idiap/temp/rbraun/code/k2/build/lib/libk2context.so(k2::GetFsaVecBasicProperties(k2::Ragged<k2::Arc>&, k2::Array1<int>*, int*)+0x1138) [0x7f4ab3700728]        
/idiap/temp/rbraun/code/k2/build/lib/libk2context.so(k2::GetFsaBasicProperties(k2::Ragged<k2::Arc> const&)+0x85) [0x7f4ab3700de5]                               
/idiap/temp/rbraun/programs/anaconda3/envs/ten/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0x4eba7) [0x7f4aba81eba7]                        
/idiap/temp/rbraun/programs/anaconda3/envs/ten/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0x23cd5) [0x7f4aba7f3cd5]                        
/idiap/temp/rbraun/programs/anaconda3/envs/ten/bin/python(PyCFunction_Call+0x58) [0x55db6b8592d8]                                                               
/idiap/temp/rbraun/programs/anaconda3/envs/ten/bin/python(_PyObject_MakeTpCall+0x23c) [0x55db6b848edc]                                                          
/idiap/temp/rbraun/programs/anaconda3/envs/ten/bin/python(_PyEval_EvalFrameDefault+0x45a9) [0x55db6b8d4879]                                                     
/idiap/temp/rbraun/programs/anaconda3/envs/ten/bin/python(+0x19ad6a) [0x55db6b89fd6a]                                                                           
/idiap/temp/rbraun/programs/anaconda3/envs/ten/bin/python(_PyObject_GenericGetAttrWithDict+0xfb) [0x55db6b785beb]                                               
/idiap/temp/rbraun/programs/anaconda3/envs/ten/bin/python(+0x15c09d) [0x55db6b86109d]                                                                           
/idiap/temp/rbraun/programs/anaconda3/envs/ten/bin/python(_PyEval_EvalFrameDefault+0x9be) [0x55db6b8d0c8e]                                                      
/idiap/temp/rbraun/programs/anaconda3/envs/ten/bin/python(+0x19a85b) [0x55db6b89f85b]                                                                           
/idiap/temp/rbraun/programs/anaconda3/envs/ten/bin/python(+0x1039bd) [0x55db6b8089bd]                                                                           
/idiap/temp/rbraun/programs/anaconda3/envs/ten/bin/python(_PyEval_EvalCodeWithName+0x300) [0x55db6b89e760]                                                      
/idiap/temp/rbraun/programs/anaconda3/envs/ten/bin/python(_PyFunction_Vectorcall+0x1e3) [0x55db6b89f593]                                                        
/idiap/temp/rbraun/programs/anaconda3/envs/ten/bin/python(_PyObject_Call_Prepend+0x291) [0x55db6b8a0161]                                                        
/idiap/temp/rbraun/programs/anaconda3/envs/ten/bin/python(+0x19b48a) [0x55db6b8a048a]                                                                           
/idiap/temp/rbraun/programs/anaconda3/envs/ten/bin/python(_PyObject_MakeTpCall+0x1ae) [0x55db6b848e4e]                                                          
/idiap/temp/rbraun/programs/anaconda3/envs/ten/bin/python(_PyEval_EvalFrameDefault+0x11dd) [0x55db6b8d14ad]                                                     
/idiap/temp/rbraun/programs/anaconda3/envs/ten/bin/python(_PyFunction_Vectorcall+0x10b) [0x55db6b89f4bb]                                                        
/idiap/temp/rbraun/programs/anaconda3/envs/ten/bin/python(+0x10425f) [0x55db6b80925f]                                                                           
/idiap/temp/rbraun/programs/anaconda3/envs/ten/bin/python(_PyEval_EvalCodeWithName+0x300) [0x55db6b89e760]
/idiap/temp/rbraun/programs/anaconda3/envs/ten/bin/python(PyEval_EvalCode+0x23) [0x55db6b9334e3]
/idiap/temp/rbraun/programs/anaconda3/envs/ten/bin/python(+0x22e584) [0x55db6b933584]
/idiap/temp/rbraun/programs/anaconda3/envs/ten/bin/python(+0x2547c4) [0x55db6b9597c4]
/idiap/temp/rbraun/programs/anaconda3/envs/ten/bin/python(+0x115620) [0x55db6b81a620]
/idiap/temp/rbraun/programs/anaconda3/envs/ten/bin/python(PyRun_SimpleFileExFlags+0x384) [0x55db6b81d362]
/idiap/temp/rbraun/programs/anaconda3/envs/ten/bin/python(+0x118e80) [0x55db6b81de80]
/idiap/temp/rbraun/programs/anaconda3/envs/ten/bin/python(Py_BytesMain+0x39) [0x55db6b95c979]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xeb) [0x7f4b3d62c09b]
/idiap/temp/rbraun/programs/anaconda3/envs/ten/bin/python(+0x1e7185) [0x55db6b8ec185]

Traceback (most recent call last):
  File "./mmi_bigram_train.py", line 559, in <module>
    main()
  File "./mmi_bigram_train.py", line 339, in main
    graph_compiler = MmiTrainingGraphCompiler(
  File "/remote/idiap.svm/temp.speech01/rbraun/code/snowfall/snowfall/training/mmi_graph.py", line 84, in __init__
    self.ctc_topo_inv = k2.arc_sort(ctc_topo.invert_())
  File "/idiap/temp/rbraun/programs/anaconda3/envs/ten/lib/python3.8/site-packages/k2/fsa.py", line 816, in invert_
    self.properties
  File "/idiap/temp/rbraun/programs/anaconda3/envs/ten/lib/python3.8/site-packages/k2/fsa.py", line 367, in properties
    properties = _k2.get_fsa_basic_properties(self.arcs)
RuntimeError: Some bad things happed.
Traceback (most recent call last):
  File "/idiap/temp/rbraun/programs/anaconda3/envs/ten/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/idiap/temp/rbraun/programs/anaconda3/envs/ten/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/idiap/temp/rbraun/programs/anaconda3/envs/ten/lib/python3.8/site-packages/torch/distributed/launch.py", line 340, in <module>
    main()
  File "/idiap/temp/rbraun/programs/anaconda3/envs/ten/lib/python3.8/site-packages/torch/distributed/launch.py", line 326, in main
    sigkill_handler(signal.SIGTERM, None)  # not coming back
  File "/idiap/temp/rbraun/programs/anaconda3/envs/ten/lib/python3.8/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
    raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/idiap/temp/rbraun/programs/anaconda3/envs/ten/bin/python', '-u', './mmi_bigram_train.py', '--local_rank=0', '--world_s
ize', '1']' returned non-zero exit status 1.
Killing subprocess 4431
# Accounting: time=28 threads=1
# Finished at Wed Mar 24 21:04:14 CET 2021 with status 1

Running on a 1080ti

@danpovey
Copy link
Collaborator

danpovey commented Mar 25, 2021 via email

@RuABraun
Copy link
Author

Yeah that was with, just tried without:

# python ./mmi_bigram_train.py --world_size 1 
[F] /idiap/temp/rbraun/code/k2/k2/csrc/eval.h:134:void k2::EvalDevice(cudaStream_t, int32_t, LambdaT&) [with LambdaT = __nv_dl_wrapper_t<__nv_dl_tag<void (k2::Ar
ray1<char>::*)(char), &k2::Array1<char>::operator=, 1>, char*, const char>; cudaStream_t = CUstream_st*; int32_t = int] Check failed: e == cudaSuccess (700 vs. 0
)  Error: an illegal memory access was encountered. 


[ Stack-Trace: ]
/idiap/temp/rbraun/code/k2/build/lib/libk2_log.so(k2::internal::GetStackTrace()+0x34) [0x7efb7920a084]
/idiap/temp/rbraun/code/k2/build/lib/libk2context.so(k2::internal::Logger::~Logger()+0x2a) [0x7efb7937b78a]
/idiap/temp/rbraun/code/k2/build/lib/libk2context.so(void k2::EvalDevice<__nv_dl_wrapper_t<__nv_dl_tag<void (k2::Array1<char>::*)(char), &k2::Array1<char>::opera
tor=, 1u>, char*, char const> >(CUstream_st*, int, __nv_dl_wrapper_t<__nv_dl_tag<void (k2::Array1<char>::*)(char), &k2::Array1<char>::operator=, 1u>, char*, char
 const>&)+0x16c) [0x7efb793cb65c]
/idiap/temp/rbraun/code/k2/build/lib/libk2context.so(k2::GetFsaVecBasicProperties(k2::Ragged<k2::Arc>&, k2::Array1<int>*, int*)+0x1138) [0x7efb793bd728]
/idiap/temp/rbraun/code/k2/build/lib/libk2context.so(k2::GetFsaBasicProperties(k2::Ragged<k2::Arc> const&)+0x85) [0x7efb793bdde5]
/idiap/temp/rbraun/programs/anaconda3/envs/ten/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0x4eba7) [0x7efb804dbba7]
/idiap/temp/rbraun/programs/anaconda3/envs/ten/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0x23cd5) [0x7efb804b0cd5]
python(PyCFunction_Call+0x58) [0x562aae5382d8]
python(_PyObject_MakeTpCall+0x23c) [0x562aae527edc]
python(_PyEval_EvalFrameDefault+0x45a9) [0x562aae5b3879]
python(+0x19ad6a) [0x562aae57ed6a]
python(_PyObject_GenericGetAttrWithDict+0xfb) [0x562aae464beb]
python(+0x15c09d) [0x562aae54009d]
python(_PyEval_EvalFrameDefault+0x9be) [0x562aae5afc8e]
python(+0x19a85b) [0x562aae57e85b]
python(+0x1039bd) [0x562aae4e79bd]
python(_PyEval_EvalCodeWithName+0x300) [0x562aae57d760]
python(_PyFunction_Vectorcall+0x1e3) [0x562aae57e593]
python(_PyObject_Call_Prepend+0x291) [0x562aae57f161]
python(+0x19b48a) [0x562aae57f48a]
python(_PyObject_MakeTpCall+0x1ae) [0x562aae527e4e]
python(_PyEval_EvalFrameDefault+0x11dd) [0x562aae5b04ad]
python(_PyFunction_Vectorcall+0x10b) [0x562aae57e4bb]
python(+0x10425f) [0x562aae4e825f]
python(_PyEval_EvalCodeWithName+0x300) [0x562aae57d760]
python(PyEval_EvalCode+0x23) [0x562aae6124e3]
python(+0x22e584) [0x562aae612584]
python(+0x2547c4) [0x562aae6387c4]
python(+0x115620) [0x562aae4f9620]
python(PyRun_SimpleFileExFlags+0x384) [0x562aae4fc362]
python(+0x118e80) [0x562aae4fce80]
python(Py_BytesMain+0x39) [0x562aae63b979]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xeb) [0x7efc032e909b]
python(+0x1e7185) [0x562aae5cb185]

World size: 1 Rank: 0
Traceback (most recent call last):
  File "./mmi_bigram_train.py", line 559, in <module>
    main()
  File "./mmi_bigram_train.py", line 339, in main
    graph_compiler = MmiTrainingGraphCompiler(
  File "/remote/idiap.svm/temp.speech01/rbraun/code/snowfall/snowfall/training/mmi_graph.py", line 84, in __init__
    self.ctc_topo_inv = k2.arc_sort(ctc_topo.invert_())
  File "/idiap/temp/rbraun/programs/anaconda3/envs/ten/lib/python3.8/site-packages/k2/fsa.py", line 816, in invert_
    self.properties
  File "/idiap/temp/rbraun/programs/anaconda3/envs/ten/lib/python3.8/site-packages/k2/fsa.py", line 367, in properties
    properties = _k2.get_fsa_basic_properties(self.arcs)
RuntimeError: Some bad things happed.
# Accounting: time=23 threads=1
# Finished at Thu Mar 25 09:32:09 CET 2021 with status 1

@danpovey
Copy link
Collaborator

danpovey commented Mar 25, 2021 via email

@danpovey
Copy link
Collaborator

danpovey commented Mar 25, 2021 via email

@danpovey
Copy link
Collaborator

danpovey commented Mar 25, 2021 via email

@RuABraun
Copy link
Author

Some extra info

I compiled without release mode and ran the tests and got a few timeout failures. Don't have a GPU available locally ( unfortunately GPU nodes where I am do not have cmake..), could that be the issue (since the tests time-out)?

97% tests passed, 3 tests failed out of 88

Total Test time (real) = 1511.02 sec

The following tests FAILED:
         53 - get_forward_scores_test_py (Timeout)
         54 - get_tot_scores_test_py (Timeout)
         62 - intersect_test_py (Timeout)
Errors while running CTest

This is what I get while running with CUDA_VISIBLE_DEVICES=0 and no release mode:

(ten) rbraun@vgne038:/idiap/temp/rbraun/code/snowfall/egs/librispeech/asr/simple_v1$ CUDA_VISIBLE_DEVICES=0 python mmi_bigram_train.py --world_size 1   [62/2813]
World size: 1 Rank: 0                                  
[F] /idiap/temp/rbraun/code/k2/k2/csrc/array.h:275:T k2::Array1<T>::operator[](int32_t) const [with T = int; int32_t = int] Check failed: ret == cudaSuccess (700
 vs. 0)  Error: an illegal memory access was encountered.
                                  
                                                      
[ Stack-Trace: ]                                        
/idiap/temp/rbraun/code/k2/build/lib/libk2_log.so(k2::internal::GetStackTrace()+0x46) [0x7fb5f9193edd]
/idiap/temp/rbraun/code/k2/build/lib/libk2context.so(k2::internal::Logger::~Logger()+0x35) [0x7fb5f94d29f3]
/idiap/temp/rbraun/code/k2/build/lib/libk2context.so(k2::Array1<int>::operator[](int) const+0x56c) [0x7fb5f94d7166]
/idiap/temp/rbraun/code/k2/build/lib/libk2context.so(k2::Array1<int>::Back() const+0x130) [0x7fb5f94d4a10]
/idiap/temp/rbraun/code/k2/build/lib/libk2context.so(k2::RaggedShape::Validate(bool) const+0x244) [0x7fb5f96392b0]                                               
/idiap/temp/rbraun/code/k2/build/lib/libk2context.so(k2::RaggedShape::Check()+0x1e) [0x7fb5f957e1ce]
/idiap/temp/rbraun/code/k2/build/lib/libk2context.so(k2::RaggedShape::RaggedShape(std::vector<k2::RaggedShapeLayer, std::allocator<k2::RaggedShapeLayer> > const&
, bool)+0x57) [0x7fb5f957e18b]                        
/idiap/temp/rbraun/code/k2/build/lib/libk2context.so(k2::RemoveAxis(k2::RaggedShape&, int)+0x4e1) [0x7fb5f964ad5e]
/idiap/temp/rbraun/code/k2/build/lib/libk2context.so(k2::GetFsaVecBasicProperties(k2::Ragged<k2::Arc>&, k2::Array1<int>*, int*)+0x40b) [0x7fb5f95160c2]
/idiap/temp/rbraun/code/k2/build/lib/libk2context.so(k2::GetFsaBasicProperties(k2::Ragged<k2::Arc> const&)+0x7f) [0x7fb5f95168f3]
/idiap/temp/rbraun/programs/anaconda3/envs/ten/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0xb4409) [0x7fb60090a409]
/idiap/temp/rbraun/programs/anaconda3/envs/ten/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0xb054d) [0x7fb60090654d]
/idiap/temp/rbraun/programs/anaconda3/envs/ten/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0xabf80) [0x7fb600901f80]
/idiap/temp/rbraun/programs/anaconda3/envs/ten/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0xac08f) [0x7fb60090208f]
/idiap/temp/rbraun/programs/anaconda3/envs/ten/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0x514e2) [0x7fb6008a74e2]                         
python(PyCFunction_Call+0x58) [0x560b83d5c2d8]                                                                                                                  
python(_PyObject_MakeTpCall+0x23c) [0x560b83d4bedc]
python(_PyEval_EvalFrameDefault+0x45a9) [0x560b83dd7879]
python(+0x19ad6a) [0x560b83da2d6a]
python(_PyObject_GenericGetAttrWithDict+0xfb) [0x560b83c88beb]
python(+0x15c09d) [0x560b83d6409d]
python(_PyEval_EvalFrameDefault+0x9be) [0x560b83dd3c8e]
python(+0x19a85b) [0x560b83da285b]
python(+0x1039bd) [0x560b83d0b9bd]
python(_PyEval_EvalCodeWithName+0x300) [0x560b83da1760]
python(_PyFunction_Vectorcall+0x1e3) [0x560b83da2593]
python(_PyObject_Call_Prepend+0x291) [0x560b83da3161]
python(+0x19b48a) [0x560b83da348a]
python(_PyObject_MakeTpCall+0x1ae) [0x560b83d4be4e]
python(_PyEval_EvalFrameDefault+0x11dd) [0x560b83dd44ad]
python(_PyFunction_Vectorcall+0x10b) [0x560b83da24bb]
python(+0x10425f) [0x560b83d0c25f]
python(_PyEval_EvalCodeWithName+0x300) [0x560b83da1760]
python(PyEval_EvalCode+0x23) [0x560b83e364e3]
python(+0x22e584) [0x560b83e36584]                                                                                                                               
python(+0x2547c4) [0x560b83e5c7c4]                       
python(+0x115620) [0x560b83d1d620]
python(PyRun_SimpleFileExFlags+0x384) [0x560b83d20362]
python(+0x118e80) [0x560b83d20e80]                      
python(Py_BytesMain+0x39) [0x560b83e5f979]                                                            
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xeb) [0x7fb6837d309b]                                   
python(+0x1e7185) [0x560b83def185]                                                                                 
                                                                                                          
[F] /idiap/temp/rbraun/code/k2/k2/csrc/context.cu:91:void k2::ParallelRunnerActive::Finish() Check failed: ret == cudaSuccess (700 vs. 0)  Error: an illegal memo
ry access was encountered.                                                                          
                                                                                                                                                                 
                                                      
[ Stack-Trace: ]                                                                                                  
/idiap/temp/rbraun/code/k2/build/lib/libk2_log.so(k2::internal::GetStackTrace()+0x46) [0x7fb5f9193edd]                                                 
/idiap/temp/rbraun/code/k2/build/lib/libk2context.so(k2::internal::Logger::~Logger()+0x35) [0x7fb5f94d29f3]                      
/idiap/temp/rbraun/code/k2/build/lib/libk2context.so(k2::ParallelRunnerActive::Finish()+0x9ac) [0x7fb5f94fb290]                         
/idiap/temp/rbraun/code/k2/build/lib/libk2context.so(k2::ParallelRunnerActive::~ParallelRunnerActive()+0x18) [0x7fb5f951c76a]           
/idiap/temp/rbraun/code/k2/build/lib/libk2context.so(k2::RaggedShape::Validate(bool) const+0x1187) [0x7fb5f963a1f3]                     
/idiap/temp/rbraun/code/k2/build/lib/libk2context.so(k2::RaggedShape::Check()+0x1e) [0x7fb5f957e1ce]                                    
/idiap/temp/rbraun/code/k2/build/lib/libk2context.so(k2::RaggedShape::RaggedShape(std::vector<k2::RaggedShapeLayer, std::allocator<k2::RaggedShapeLayer> > const&
, bool)+0x57) [0x7fb5f957e18b]                                                                                                                                  
/idiap/temp/rbraun/code/k2/build/lib/libk2context.so(k2::RemoveAxis(k2::RaggedShape&, int)+0x4e1) [0x7fb5f964ad5e]
/idiap/temp/rbraun/code/k2/build/lib/libk2context.so(k2::GetFsaVecBasicProperties(k2::Ragged<k2::Arc>&, k2::Array1<int>*, int*)+0x40b) [0x7fb5f95160c2]
/idiap/temp/rbraun/code/k2/build/lib/libk2context.so(k2::GetFsaBasicProperties(k2::Ragged<k2::Arc> const&)+0x7f) [0x7fb5f95168f3]
/idiap/temp/rbraun/programs/anaconda3/envs/ten/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0xb4409) [0x7fb60090a409]
/idiap/temp/rbraun/programs/anaconda3/envs/ten/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0xb054d) [0x7fb60090654d]
/idiap/temp/rbraun/programs/anaconda3/envs/ten/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0xabf80) [0x7fb600901f80]
/idiap/temp/rbraun/programs/anaconda3/envs/ten/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0xac08f) [0x7fb60090208f]
/idiap/temp/rbraun/programs/anaconda3/envs/ten/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0x514e2) [0x7fb6008a74e2]
python(PyCFunction_Call+0x58) [0x560b83d5c2d8]         
python(_PyObject_MakeTpCall+0x23c) [0x560b83d4bedc]  
python(_PyEval_EvalFrameDefault+0x45a9) [0x560b83dd7879]
python(+0x19ad6a) [0x560b83da2d6a]
python(_PyObject_GenericGetAttrWithDict+0xfb) [0x560b83c88beb]
python(+0x15c09d) [0x560b83d6409d]                      
python(_PyEval_EvalFrameDefault+0x9be) [0x560b83dd3c8e]
python(+0x19a85b) [0x560b83da285b]
python(+0x1039bd) [0x560b83d0b9bd]                     
python(_PyEval_EvalCodeWithName+0x300) [0x560b83da1760]
python(_PyFunction_Vectorcall+0x1e3) [0x560b83da2593]
python(_PyObject_Call_Prepend+0x291) [0x560b83da3161]
python(+0x19b48a) [0x560b83da348a]
python(_PyObject_MakeTpCall+0x1ae) [0x560b83d4be4e]   
python(_PyEval_EvalFrameDefault+0x11dd) [0x560b83dd44ad]
python(_PyFunction_Vectorcall+0x10b) [0x560b83da24bb]
python(+0x10425f) [0x560b83d0c25f]                                      
python(_PyEval_EvalCodeWithName+0x300) [0x560b83da1760]
python(PyEval_EvalCode+0x23) [0x560b83e364e3]
python(+0x22e584) [0x560b83e36584]                                                                                                                               
python(+0x2547c4) [0x560b83e5c7c4]
python(+0x115620) [0x560b83d1d620]
python(PyRun_SimpleFileExFlags+0x384) [0x560b83d20362]
python(+0x118e80) [0x560b83d20e80]
python(Py_BytesMain+0x39) [0x560b83e5f979]                                                            
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xeb) [0x7fb6837d309b]                                   
python(+0x1e7185) [0x560b83def185]                                                                             
                                                                                                                             
terminate called after throwing an instance of 'std::runtime_error'                                                
  what():  Some bad things happed.                                                                  
Aborted 

@RuABraun
Copy link
Author

When running with cuda-memcheck I get a bunch of Invalid __shared__ read of size 4. Not sure if I should paste the whole log it is rather massive.

@danpovey
Copy link
Collaborator

danpovey commented Mar 25, 2021 via email

@RuABraun
Copy link
Author

RuABraun commented Mar 25, 2021

sample

========= Invalid __shared__ read of size 4                                                                                                                      
=========     at 0x00000db8 in _ZN4mgpu16launch_box_cta_kINS_12launch_box_tIJNS_7arch_20INS_12launch_cta_tILi128ELi11ELi9ELi0EEENS_7empty_tEEENS_7arch_35INS3_ILi
128ELi7ELi5ELi0EEES5_EENS_7arch_52IS4_S5_EEEEEZNS_13transform_lbsIS5_ZNS_13transform_lbsIS5_ZNS_19load_balance_searchIS5_PKiPiEEviT0_iT1_RNS_9context_tEEUliiiE_S
H_JEEEvSJ_iSK_iSM_DpT2_EUliiiNS_5tupleIJEEEE_SH_SR_JEEEvSJ_iSK_iSO_SM_DpT3_EUliiE_JEEEvSJ_iDpSK_                                                                 
=========     by thread (64,0,0) in block (3,0,0)                                                                                                                
=========     Address 0x00000e08 is out of bounds                                                                                                                
=========     Saved host backtrace up to driver entry point at kernel launch time                                                                                
=========     Host Frame:/lib/x86_64-linux-gnu/libcuda.so.1 (cuLaunchKernel + 0x2b8) [0x1e5b88]                                                                  
=========     Host Frame:/lib/x86_64-linux-gnu/libcudart.so.11.0 [0x101cb]                                                                                       
=========     Host Frame:/lib/x86_64-linux-gnu/libcudart.so.11.0 (cudaLaunchKernel + 0x1b5) [0x53765]                                                            
=========     Host Frame:/idiap/temp/rbraun/code/k2/build/lib/libk2context.so [0x40b669]                                                                        
=========     Host Frame:/idiap/temp/rbraun/code/k2/build/lib/libk2context.so [0x40adae]                                                                         
=========     Host Frame:/idiap/temp/rbraun/code/k2/build/lib/libk2context.so [0x40ade2]                                                                         
=========     Host Frame:/idiap/temp/rbraun/code/k2/build/lib/libk2context.so (_ZN4mgpu16launch_box_cta_kINS_12launch_box_tIJNS_7arch_20INS_12launch_cta_tILi128E
Li11ELi9ELi0EEENS_7empty_tEEENS_7arch_35INS3_ILi128ELi7ELi5ELi0EEES5_EENS_7arch_52IS4_S5_EEEEE17__nv_dl_wrapper_tI11__nv_dl_tagIPFvSD_ISE_IPFvSD_ISE_IPFviPKiiPiR
NS_9context_tEEXadL_ZNS_19load_balance_searchIS5_SG_SH_EEviT0_iT1_SJ_EELj1EEJSH_EEiSG_iSJ_EXadL_ZNS_13transform_lbsIS5_SQ_SG_JEEEvSN_iSO_iSJ_DpT2_EELj1EEJSQ_EEiS
G_iNS_5tupleIJEEESJ_EXadL_ZNS_13transform_lbsIS5_SX_SG_SZ_JEEEvSN_iSO_iT2_SJ_DpT3_EELj1EEJiSG_iSG_SZ_SX_EEJEEEvSN_iDpT1_ + 0x1b) [0x40bf7d]                      
=========     Host Frame:/idiap/temp/rbraun/code/k2/build/lib/libk2context.so (_ZN4mgpu10cta_launchINS_12launch_box_tIJNS_7arch_20INS_12launch_cta_tILi128ELi11EL
i9ELi0EEENS_7empty_tEEENS_7arch_35INS3_ILi128ELi7ELi5ELi0EEES5_EENS_7arch_52IS4_S5_EEEEE17__nv_dl_wrapper_tI11__nv_dl_tagIPFvSD_ISE_IPFvSD_ISE_IPFviPKiiPiRNS_9co
ntext_tEEXadL_ZNS_19load_balance_searchIS5_SG_SH_EEviT0_iT1_SJ_EELj1EEJSH_EEiSG_iSJ_EXadL_ZNS_13transform_lbsIS5_SQ_SG_JEEEvSN_iSO_iSJ_DpT2_EELj1EEJSQ_EEiSG_iNS_
5tupleIJEEESJ_EXadL_ZNS_13transform_lbsIS5_SX_SG_SZ_JEEEvSN_iSO_iT2_SJ_DpT3_EELj1EEJiSG_iSG_SZ_SX_EEJEEEvSN_iSJ_DpT1_ + 0x14c) [0x40cd60]                       
=========     Host Frame:/idiap/temp/rbraun/code/k2/build/lib/libk2context.so (_ZN4mgpu13cta_transformINS_12launch_box_tIJNS_7arch_20INS_12launch_cta_tILi128ELi1
1ELi9ELi0EEENS_7empty_tEEENS_7arch_35INS3_ILi128ELi7ELi5ELi0EEES5_EENS_7arch_52IS4_S5_EEEEE17__nv_dl_wrapper_tI11__nv_dl_tagIPFvSD_ISE_IPFvSD_ISE_IPFviPKiiPiRNS_
9context_tEEXadL_ZNS_19load_balance_searchIS5_SG_SH_EEviT0_iT1_SJ_EELj1EEJSH_EEiSG_iSJ_EXadL_ZNS_13transform_lbsIS5_SQ_SG_JEEEvSN_iSO_iSJ_DpT2_EELj1EEJSQ_EEiSG_i
NS_5tupleIJEEESJ_EXadL_ZNS_13transform_lbsIS5_SX_SG_SZ_JEEEvSN_iSO_iT2_SJ_DpT3_EELj1EEJiSG_iSG_SZ_SX_EEJEEEvSN_iSJ_DpT1_ + 0x83) [0x40c96f]                      
=========     Host Frame:/idiap/temp/rbraun/code/k2/build/lib/libk2context.so (_ZN4mgpu13transform_lbsINS_7empty_tE17__nv_dl_wrapper_tI11__nv_dl_tagIPFvS2_IS3_IP
FviPKiiPiRNS_9context_tEEXadL_ZNS_19load_balance_searchIS1_S5_S6_EEviT0_iT1_S8_EELj1EEJS6_EEiS5_iS8_EXadL_ZNS_13transform_lbsIS1_SF_S5_JEEEvSC_iSD_iS8_DpT2_EELj1
EEJSF_EES5_NS_5tupleIJEEEJEEEvSC_iSD_iT2_S8_DpT3_ + 0xf8) [0x40bc09]

sample 2 (they all look the same)

========= Invalid __shared__ read of size 4                                                                                                                     
=========     at 0x00000db8 in _ZN4mgpu16launch_box_cta_kINS_12launch_box_tIJNS_7arch_20INS_12launch_cta_tILi128ELi11ELi9ELi0EEENS_7empty_tEEENS_7arch_35INS3_ILi
128ELi7ELi5ELi0EEES5_EENS_7arch_52IS4_S5_EEEEEZNS_13transform_lbsIS5_ZNS_13transform_lbsIS5_ZNS_19load_balance_searchIS5_PKiPiEEviT0_iT1_RNS_9context_tEEUliiiE_S
H_JEEEvSJ_iSK_iSM_DpT2_EUliiiNS_5tupleIJEEEE_SH_SR_JEEEvSJ_iSK_iSO_SM_DpT3_EUliiE_JEEEvSJ_iDpSK_                                                                 
=========     by thread (37,0,0) in block (2,0,0)                                                                                                                
=========     Address 0x00000e08 is out of bounds                                                                                                                
=========     Saved host backtrace up to driver entry point at kernel launch time                                                                                
=========     Host Frame:/lib/x86_64-linux-gnu/libcuda.so.1 (cuLaunchKernel + 0x2b8) [0x1e5b88]                                                                  
=========     Host Frame:/lib/x86_64-linux-gnu/libcudart.so.11.0 [0x101cb]                                                                                       
=========     Host Frame:/lib/x86_64-linux-gnu/libcudart.so.11.0 (cudaLaunchKernel + 0x1b5) [0x53765]                                                            
=========     Host Frame:/idiap/temp/rbraun/code/k2/build/lib/libk2context.so [0x40b669]                                                                        
=========     Host Frame:/idiap/temp/rbraun/code/k2/build/lib/libk2context.so [0x40adae]                                                                         
=========     Host Frame:/idiap/temp/rbraun/code/k2/build/lib/libk2context.so [0x40ade2]                                                                         
=========     Host Frame:/idiap/temp/rbraun/code/k2/build/lib/libk2context.so (_ZN4mgpu16launch_box_cta_kINS_12launch_box_tIJNS_7arch_20INS_12launch_cta_tILi128E
Li11ELi9ELi0EEENS_7empty_tEEENS_7arch_35INS3_ILi128ELi7ELi5ELi0EEES5_EENS_7arch_52IS4_S5_EEEEE17__nv_dl_wrapper_tI11__nv_dl_tagIPFvSD_ISE_IPFvSD_ISE_IPFviPKiiPiR
NS_9context_tEEXadL_ZNS_19load_balance_searchIS5_SG_SH_EEviT0_iT1_SJ_EELj1EEJSH_EEiSG_iSJ_EXadL_ZNS_13transform_lbsIS5_SQ_SG_JEEEvSN_iSO_iSJ_DpT2_EELj1EEJSQ_EEiS
G_iNS_5tupleIJEEESJ_EXadL_ZNS_13transform_lbsIS5_SX_SG_SZ_JEEEvSN_iSO_iT2_SJ_DpT3_EELj1EEJiSG_iSG_SZ_SX_EEJEEEvSN_iDpT1_ + 0x1b) [0x40bf7d]

final output

[F] /idiap/temp/rbraun/code/k2/k2/csrc/eval.h:134:void k2::EvalDevice(cudaStream_t, int32_t, LambdaT&) [with LambdaT = __nv_dl_wrapper_t<__nv_dl_tag<void (*)(k2:
:Ragged<k2::Arc>&, k2::Array1<int>*, int*), k2::GetFsaVecBasicProperties, 1>, k2::Arc*, const int*, const int*, const int*, const int*, char*, int, int*>; cudaSt
ream_t = CUstream_st*; int32_t = int] Check failed: e == cudaSuccess (719 vs. 0)  Error: unspecified launch failure.                                             
                                                                                                                                                                 
                                                                                                                                                                 
[ Stack-Trace: ]                                                                                                                                                 
/idiap/temp/rbraun/code/k2/build/lib/libk2_log.so(k2::internal::GetStackTrace()+0x46) [0x7f0348134edd]                                                           
/idiap/temp/rbraun/code/k2/build/lib/libk2context.so(k2::internal::Logger::~Logger()+0x35) [0x7f03484739f3]                                                      
/idiap/temp/rbraun/code/k2/build/lib/libk2context.so(void k2::EvalDevice<__nv_dl_wrapper_t<__nv_dl_tag<void (*)(k2::Ragged<k2::Arc>&, k2::Array1<int>*, int*), &k
2::GetFsaVecBasicProperties, 1u>, k2::Arc*, int const*, int const*, int const*, int const*, char*, int, int*> >(CUstream_st*, int, __nv_dl_wrapper_t<__nv_dl_tag<
void (*)(k2::Ragged<k2::Arc>&, k2::Array1<int>*, int*), &k2::GetFsaVecBasicProperties, 1u>, k2::Arc*, int const*, int const*, int const*, int const*, char*, int,
 int*>&)+0x354) [0x7f03484c44d7]                                                                                                                                 
/idiap/temp/rbraun/code/k2/build/lib/libk2context.so(void k2::EvalDevice<std::shared_ptr<k2::Context>, __nv_dl_wrapper_t<__nv_dl_tag<void (*)(k2::Ragged<k2::Arc>
&, k2::Array1<int>*, int*), &k2::GetFsaVecBasicProperties, 1u>, k2::Arc*, int const*, int const*, int const*, int const*, char*, int, int*> >(std::shared_ptr<k2:
:Context>, int, __nv_dl_wrapper_t<__nv_dl_tag<void (*)(k2::Ragged<k2::Arc>&, k2::Array1<int>*, int*), &k2::GetFsaVecBasicProperties, 1u>, k2::Arc*, int const*, i
nt const*, int const*, int const*, char*, int, int*>&)+0x42) [0x7f03484bfe64]                                                                                    
/idiap/temp/rbraun/code/k2/build/lib/libk2context.so(k2::GetFsaVecBasicProperties(k2::Ragged<k2::Arc>&, k2::Array1<int>*, int*)+0x3de) [0x7f03484b7095]          
/idiap/temp/rbraun/code/k2/build/lib/libk2context.so(k2::GetFsaBasicProperties(k2::Ragged<k2::Arc> const&)+0x7f) [0x7f03484b78f3]                                
/idiap/temp/rbraun/programs/anaconda3/envs/ten/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0xb4409) [0x7f034f8ab409]                         
/idiap/temp/rbraun/programs/anaconda3/envs/ten/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0xb054d) [0x7f034f8a754d]                         
/idiap/temp/rbraun/programs/anaconda3/envs/ten/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0xabf80) [0x7f034f8a2f80]                         
/idiap/temp/rbraun/programs/anaconda3/envs/ten/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0xac08f) [0x7f034f8a308f]                         
/idiap/temp/rbraun/programs/anaconda3/envs/ten/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0x514e2) [0x7f034f8484e2]                         
python(PyCFunction_Call+0x58) [0x55a6e2a402d8]                                                                                                                   
python(_PyObject_MakeTpCall+0x23c) [0x55a6e2a2fedc]                                                                                                              
python(_PyEval_EvalFrameDefault+0x45a9) [0x55a6e2abb879]                                                                                                         
python(+0x19ad6a) [0x55a6e2a86d6a]
python(_PyObject_GenericGetAttrWithDict+0xfb) [0x55a6e296cbeb]
python(+0x15c09d) [0x55a6e2a4809d]
python(_PyEval_EvalFrameDefault+0x9be) [0x55a6e2ab7c8e]python(_PyEval_EvalCodeWithName+0x300) [0x55a6e2a85760]
python(_PyFunction_Vectorcall+0x1e3) [0x55a6e2a86593]
python(_PyObject_Call_Prepend+0x291) [0x55a6e2a87161]
python(+0x19b48a) [0x55a6e2a8748a]
python(_PyObject_MakeTpCall+0x1ae) [0x55a6e2a2fe4e]
python(_PyEval_EvalFrameDefault+0x11dd) [0x55a6e2ab84ad]
python(_PyFunction_Vectorcall+0x10b) [0x55a6e2a864bb]
python(+0x10425f) [0x55a6e29f025f]
python(_PyEval_EvalCodeWithName+0x300) [0x55a6e2a85760]
python(PyEval_EvalCode+0x23) [0x55a6e2b1a4e3]
python(+0x22e584) [0x55a6e2b1a584]
python(+0x2547c4) [0x55a6e2b407c4]
python(+0x115620) [0x55a6e2a01620]
python(PyRun_SimpleFileExFlags+0x384) [0x55a6e2a04362]
python(+0x118e80) [0x55a6e2a04e80]
python(Py_BytesMain+0x39) [0x55a6e2b43979]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xeb) [0x7f040dbd609b]
python(+0x1e7185) [0x55a6e2ad3185]

Traceback (most recent call last):
  File "mmi_bigram_train.py", line 559, in <module>
    main()
  File "mmi_bigram_train.py", line 339, in main
    graph_compiler = MmiTrainingGraphCompiler(
  File "/idiap/temp/rbraun/programs/anaconda3/envs/ten/lib/python3.8/site-packages/snowfall/training/mmi_graph.py", line 84, in __init__
    self.ctc_topo_inv = k2.arc_sort(ctc_topo.invert_())
  File "/idiap/temp/rbraun/programs/anaconda3/envs/ten/lib/python3.8/site-packages/k2/fsa.py", line 816, in invert_
    self.properties
  File "/idiap/temp/rbraun/programs/anaconda3/envs/ten/lib/python3.8/site-packages/k2/fsa.py", line 367, in properties
    properties = _k2.get_fsa_basic_properties(self.arcs)
RuntimeError: Some bad things happed.
========= ERROR SUMMARY: 192 errors

@danpovey
Copy link
Collaborator

danpovey commented Mar 25, 2021 via email

@danpovey
Copy link
Collaborator

danpovey commented Mar 25, 2021 via email

@danpovey
Copy link
Collaborator

From
https://stackoverflow.com/questions/31122170/trace-for-function-name-from-the-output-of-cuda-memcheck
it looks like cuda-gdb has a --show-backtrace optin that may help.

@RuABraun
Copy link
Author

Okay I will try, may need a bit

@csukuangfj
Copy link
Collaborator

https://github.com/k2-fsa/k2/issues/696#issuecomment-806060787#695

@RuABraun
Could you try #699? I think you could use CUDA 11 now with that pull-request.

@danpovey
Copy link
Collaborator

Thanks, Fangjun!
And thanks, @RuABraun, for helping us find the issue!

@RuABraun
Copy link
Author

Seems to work! Just I needed to add logging.getLogger().setLevel(logging.INFO) at the top to get any logging output.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants