-
Notifications
You must be signed in to change notification settings - Fork 217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crash "invalid device function" when running snowfall recipe #696
Comments
Could you post the output of
|
|
Does appending 86 after 75 at line 113 in CMakeLists.txt and recompiling k2 solve the problem? |
Ah thanks I should have thought of that. That seems to have been it getting a different error now. |
What's the new error? |
Got several different exceptions but I think I can figure them out (one was about file in use, another GPU OOM). I'll make a comment if I get stuck. :) |
Okay I don't know what is happening 😆
The files in |
I think that if |
So previously I was running via SGE. I tried without and get an OOM memory. I want to reduce the batch size (and confirm that it runs fine without SGE), but I can't figure out how to do that @pzelasko ? I tried changing
Don't know what else I could do to reduce it since the argument to |
You figured it out just fine :) the |
Actually I might have run into the same issue as you -- not 100% sure tho, it was some time ago. |
Yes it's CUDA 11. Thanks for the tip I'll change to 10.2 then! |
Mm. If we are not compatible with CUDA 11 we should try to detect that and
throw an error.
I wonder if it could be incompatibility with the version of CUDA that
PyTorch was compiled with? Do you remember any details, Piotr?
That error is from PyTorch, and the indexes were computed by PyTorch, which
is odd.
…On Thu, Mar 25, 2021 at 2:41 AM Rudolf A. Braun ***@***.***> wrote:
Yes it's CUDA 11. Thanks for the tip I'll change to 10.2 then!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#696 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZFLOYZRFFJKZ6BQNOLS6LTFIW7BANCNFSM4ZXIV2UQ>
.
|
I tried using 10.2, here's the k2 version info:
But I got a different error:
Running on a 1080ti |
Does this happen when you are not using torch.distributed, i.e. when you
are running training as normal?
I believe we had some issues with distributed training, i.e. bugs we had
not resolved; Piotr may remember.
We had trouble debugging the issue, and I preferred that we make progress
on other things because such things are
going to be a lot of work to get to the bottom of.
…On Thu, Mar 25, 2021 at 4:20 PM Rudolf A. Braun ***@***.***> wrote:
I tried using 10.2, here's the k2 version info:
(ten) ***@***.***:/idiap/temp/rbraun/code$ python -m k2.version
Collecting environment information...
k2 version: 0.3.3
Build type: Release
Git SHA1: 1efe57f
Git date: Tue Mar 23 09:49:54 2021
Cuda used to build k2: 10.2
cuDNN used to build k2: 8.0.0
Python version used to build k2: 3.8
OS used to build k2: Debian GNU/Linux 10 (buster)
CMake version: 3.13.4
GCC version: 8.3.0
CMAKE_CUDA_FLAGS: -D_GLIBCXX_USE_CXX11_ABI=0 --expt-extended-lambda -gencode arch=compute_35,code=sm_35 --expt-extended-lambda -gencode arch=compute_50,code=sm_50 --expt-extended-lambda -gencode arch=compute_60,code=sm_60 --expt-extended-lambda -gencode arch=compute_61,code=sm_61 --expt-extended-lambda -gencode arch=compute_70,code=sm_70 --expt-extended-lambda -gencode arch=compute_75,code=sm_75 --compiler-options -Wall --compiler-options -Wno-unknown-pragmas
CMAKE_CXX_FLAGS: -D_GLIBCXX_USE_CXX11_ABI=0
PyTorch version used to build k2: 1.8.0
PyTorch is using Cuda: 10.2
NVTX enabled: True
Disable debug: True
Sync kernels : False
Disable checks: False
But I got a different error:
[F] /idiap/temp/rbraun/code/k2/k2/csrc/eval.h:134:void k2::EvalDevice(cudaStream_t, int32_t, LambdaT&) [with LambdaT = __nv_dl_wrapper_t<__nv_dl_tag<voi[30/2460]
ray1<char>::*)(char), &k2::Array1<char>::operator=, 1>, char*, const char>; cudaStream_t = CUstream_st*; int32_t = int] Check failed: e == cudaSuccess (700 vs. 0
) Error: an illegal memory access was encountered.
[ Stack-Trace: ]
/idiap/temp/rbraun/code/k2/build/lib/libk2_log.so(k2::internal::GetStackTrace()+0x34) [0x7f4ab354d084]
/idiap/temp/rbraun/code/k2/build/lib/libk2context.so(k2::internal::Logger::~Logger()+0x2a) [0x7f4ab36be78a]
/idiap/temp/rbraun/code/k2/build/lib/libk2context.so(void k2::EvalDevice<__nv_dl_wrapper_t<__nv_dl_tag<void (k2::Array1<char>::*)(char), &k2::Array1<char>::opera
tor=, 1u>, char*, char const> >(CUstream_st*, int, __nv_dl_wrapper_t<__nv_dl_tag<void (k2::Array1<char>::*)(char), &k2::Array1<char>::operator=, 1u>, char*, char
const>&)+0x16c) [0x7f4ab370e65c]
/idiap/temp/rbraun/code/k2/build/lib/libk2context.so(k2::GetFsaVecBasicProperties(k2::Ragged<k2::Arc>&, k2::Array1<int>*, int*)+0x1138) [0x7f4ab3700728]
/idiap/temp/rbraun/code/k2/build/lib/libk2context.so(k2::GetFsaBasicProperties(k2::Ragged<k2::Arc> const&)+0x85) [0x7f4ab3700de5]
/idiap/temp/rbraun/programs/anaconda3/envs/ten/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0x4eba7) [0x7f4aba81eba7]
/idiap/temp/rbraun/programs/anaconda3/envs/ten/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0x23cd5) [0x7f4aba7f3cd5]
/idiap/temp/rbraun/programs/anaconda3/envs/ten/bin/python(PyCFunction_Call+0x58) [0x55db6b8592d8]
/idiap/temp/rbraun/programs/anaconda3/envs/ten/bin/python(_PyObject_MakeTpCall+0x23c) [0x55db6b848edc]
/idiap/temp/rbraun/programs/anaconda3/envs/ten/bin/python(_PyEval_EvalFrameDefault+0x45a9) [0x55db6b8d4879]
/idiap/temp/rbraun/programs/anaconda3/envs/ten/bin/python(+0x19ad6a) [0x55db6b89fd6a]
/idiap/temp/rbraun/programs/anaconda3/envs/ten/bin/python(_PyObject_GenericGetAttrWithDict+0xfb) [0x55db6b785beb]
/idiap/temp/rbraun/programs/anaconda3/envs/ten/bin/python(+0x15c09d) [0x55db6b86109d]
/idiap/temp/rbraun/programs/anaconda3/envs/ten/bin/python(_PyEval_EvalFrameDefault+0x9be) [0x55db6b8d0c8e]
/idiap/temp/rbraun/programs/anaconda3/envs/ten/bin/python(+0x19a85b) [0x55db6b89f85b]
/idiap/temp/rbraun/programs/anaconda3/envs/ten/bin/python(+0x1039bd) [0x55db6b8089bd]
/idiap/temp/rbraun/programs/anaconda3/envs/ten/bin/python(_PyEval_EvalCodeWithName+0x300) [0x55db6b89e760]
/idiap/temp/rbraun/programs/anaconda3/envs/ten/bin/python(_PyFunction_Vectorcall+0x1e3) [0x55db6b89f593]
/idiap/temp/rbraun/programs/anaconda3/envs/ten/bin/python(_PyObject_Call_Prepend+0x291) [0x55db6b8a0161]
/idiap/temp/rbraun/programs/anaconda3/envs/ten/bin/python(+0x19b48a) [0x55db6b8a048a]
/idiap/temp/rbraun/programs/anaconda3/envs/ten/bin/python(_PyObject_MakeTpCall+0x1ae) [0x55db6b848e4e]
/idiap/temp/rbraun/programs/anaconda3/envs/ten/bin/python(_PyEval_EvalFrameDefault+0x11dd) [0x55db6b8d14ad]
/idiap/temp/rbraun/programs/anaconda3/envs/ten/bin/python(_PyFunction_Vectorcall+0x10b) [0x55db6b89f4bb]
/idiap/temp/rbraun/programs/anaconda3/envs/ten/bin/python(+0x10425f) [0x55db6b80925f]
/idiap/temp/rbraun/programs/anaconda3/envs/ten/bin/python(_PyEval_EvalCodeWithName+0x300) [0x55db6b89e760]
/idiap/temp/rbraun/programs/anaconda3/envs/ten/bin/python(PyEval_EvalCode+0x23) [0x55db6b9334e3]
/idiap/temp/rbraun/programs/anaconda3/envs/ten/bin/python(+0x22e584) [0x55db6b933584]
/idiap/temp/rbraun/programs/anaconda3/envs/ten/bin/python(+0x2547c4) [0x55db6b9597c4]
/idiap/temp/rbraun/programs/anaconda3/envs/ten/bin/python(+0x115620) [0x55db6b81a620]
/idiap/temp/rbraun/programs/anaconda3/envs/ten/bin/python(PyRun_SimpleFileExFlags+0x384) [0x55db6b81d362]
/idiap/temp/rbraun/programs/anaconda3/envs/ten/bin/python(+0x118e80) [0x55db6b81de80]
/idiap/temp/rbraun/programs/anaconda3/envs/ten/bin/python(Py_BytesMain+0x39) [0x55db6b95c979]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xeb) [0x7f4b3d62c09b]
/idiap/temp/rbraun/programs/anaconda3/envs/ten/bin/python(+0x1e7185) [0x55db6b8ec185]
Traceback (most recent call last):
File "./mmi_bigram_train.py", line 559, in <module>
main()
File "./mmi_bigram_train.py", line 339, in main
graph_compiler = MmiTrainingGraphCompiler(
File "/remote/idiap.svm/temp.speech01/rbraun/code/snowfall/snowfall/training/mmi_graph.py", line 84, in __init__
self.ctc_topo_inv = k2.arc_sort(ctc_topo.invert_())
File "/idiap/temp/rbraun/programs/anaconda3/envs/ten/lib/python3.8/site-packages/k2/fsa.py", line 816, in invert_
self.properties
File "/idiap/temp/rbraun/programs/anaconda3/envs/ten/lib/python3.8/site-packages/k2/fsa.py", line 367, in properties
properties = _k2.get_fsa_basic_properties(self.arcs)
RuntimeError: Some bad things happed.
Traceback (most recent call last):
File "/idiap/temp/rbraun/programs/anaconda3/envs/ten/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/idiap/temp/rbraun/programs/anaconda3/envs/ten/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/idiap/temp/rbraun/programs/anaconda3/envs/ten/lib/python3.8/site-packages/torch/distributed/launch.py", line 340, in <module>
main()
File "/idiap/temp/rbraun/programs/anaconda3/envs/ten/lib/python3.8/site-packages/torch/distributed/launch.py", line 326, in main
sigkill_handler(signal.SIGTERM, None) # not coming back
File "/idiap/temp/rbraun/programs/anaconda3/envs/ten/lib/python3.8/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/idiap/temp/rbraun/programs/anaconda3/envs/ten/bin/python', '-u', './mmi_bigram_train.py', '--local_rank=0', '--world_s
ize', '1']' returned non-zero exit status 1.
Killing subprocess 4431
# Accounting: time=28 threads=1
# Finished at Wed Mar 24 21:04:14 CET 2021 with status 1
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#696 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZFLOY2K26SKRMPB7GV32LTFLW53ANCNFSM4ZXIV2UQ>
.
|
Yeah that was with, just tried without:
|
I assume the tests succeeded (cd build; ctest)
…On Thu, Mar 25, 2021 at 4:33 PM Rudolf A. Braun ***@***.***> wrote:
Yeah that was with, just tried without:
# python ./mmi_bigram_train.py --world_size 1
[F] /idiap/temp/rbraun/code/k2/k2/csrc/eval.h:134:void k2::EvalDevice(cudaStream_t, int32_t, LambdaT&) [with LambdaT = __nv_dl_wrapper_t<__nv_dl_tag<void (k2::Ar
ray1<char>::*)(char), &k2::Array1<char>::operator=, 1>, char*, const char>; cudaStream_t = CUstream_st*; int32_t = int] Check failed: e == cudaSuccess (700 vs. 0
) Error: an illegal memory access was encountered.
[ Stack-Trace: ]
/idiap/temp/rbraun/code/k2/build/lib/libk2_log.so(k2::internal::GetStackTrace()+0x34) [0x7efb7920a084]
/idiap/temp/rbraun/code/k2/build/lib/libk2context.so(k2::internal::Logger::~Logger()+0x2a) [0x7efb7937b78a]
/idiap/temp/rbraun/code/k2/build/lib/libk2context.so(void k2::EvalDevice<__nv_dl_wrapper_t<__nv_dl_tag<void (k2::Array1<char>::*)(char), &k2::Array1<char>::opera
tor=, 1u>, char*, char const> >(CUstream_st*, int, __nv_dl_wrapper_t<__nv_dl_tag<void (k2::Array1<char>::*)(char), &k2::Array1<char>::operator=, 1u>, char*, char
const>&)+0x16c) [0x7efb793cb65c]
/idiap/temp/rbraun/code/k2/build/lib/libk2context.so(k2::GetFsaVecBasicProperties(k2::Ragged<k2::Arc>&, k2::Array1<int>*, int*)+0x1138) [0x7efb793bd728]
/idiap/temp/rbraun/code/k2/build/lib/libk2context.so(k2::GetFsaBasicProperties(k2::Ragged<k2::Arc> const&)+0x85) [0x7efb793bdde5]
/idiap/temp/rbraun/programs/anaconda3/envs/ten/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0x4eba7) [0x7efb804dbba7]
/idiap/temp/rbraun/programs/anaconda3/envs/ten/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0x23cd5) [0x7efb804b0cd5]
python(PyCFunction_Call+0x58) [0x562aae5382d8]
python(_PyObject_MakeTpCall+0x23c) [0x562aae527edc]
python(_PyEval_EvalFrameDefault+0x45a9) [0x562aae5b3879]
python(+0x19ad6a) [0x562aae57ed6a]
python(_PyObject_GenericGetAttrWithDict+0xfb) [0x562aae464beb]
python(+0x15c09d) [0x562aae54009d]
python(_PyEval_EvalFrameDefault+0x9be) [0x562aae5afc8e]
python(+0x19a85b) [0x562aae57e85b]
python(+0x1039bd) [0x562aae4e79bd]
python(_PyEval_EvalCodeWithName+0x300) [0x562aae57d760]
python(_PyFunction_Vectorcall+0x1e3) [0x562aae57e593]
python(_PyObject_Call_Prepend+0x291) [0x562aae57f161]
python(+0x19b48a) [0x562aae57f48a]
python(_PyObject_MakeTpCall+0x1ae) [0x562aae527e4e]
python(_PyEval_EvalFrameDefault+0x11dd) [0x562aae5b04ad]
python(_PyFunction_Vectorcall+0x10b) [0x562aae57e4bb]
python(+0x10425f) [0x562aae4e825f]
python(_PyEval_EvalCodeWithName+0x300) [0x562aae57d760]
python(PyEval_EvalCode+0x23) [0x562aae6124e3]
python(+0x22e584) [0x562aae612584]
python(+0x2547c4) [0x562aae6387c4]
python(+0x115620) [0x562aae4f9620]
python(PyRun_SimpleFileExFlags+0x384) [0x562aae4fc362]
python(+0x118e80) [0x562aae4fce80]
python(Py_BytesMain+0x39) [0x562aae63b979]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xeb) [0x7efc032e909b]
python(+0x1e7185) [0x562aae5cb185]
World size: 1 Rank: 0
Traceback (most recent call last):
File "./mmi_bigram_train.py", line 559, in <module>
main()
File "./mmi_bigram_train.py", line 339, in main
graph_compiler = MmiTrainingGraphCompiler(
File "/remote/idiap.svm/temp.speech01/rbraun/code/snowfall/snowfall/training/mmi_graph.py", line 84, in __init__
self.ctc_topo_inv = k2.arc_sort(ctc_topo.invert_())
File "/idiap/temp/rbraun/programs/anaconda3/envs/ten/lib/python3.8/site-packages/k2/fsa.py", line 816, in invert_
self.properties
File "/idiap/temp/rbraun/programs/anaconda3/envs/ten/lib/python3.8/site-packages/k2/fsa.py", line 367, in properties
properties = _k2.get_fsa_basic_properties(self.arcs)
RuntimeError: Some bad things happed.
# Accounting: time=23 threads=1
# Finished at Thu Mar 25 09:32:09 CET 2021 with status 1
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#696 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZFLO6JRGEWCXF4D64V2RDTFLYOTANCNFSM4ZXIV2UQ>
.
|
Can you do
export CUDA_VISIBLE_DEVICES=0
or something like that, to make sure it's only trying to use one GPU?
I'm sure it wouldn't use >1, but just want to be more certain..
possibly running it in cuda-gdb might help to do things like print out the
address of the pointer it's failing on, or figure out why it doesn't
seem to be a valid memory pointer in whatever context it's running in.
…On Thu, Mar 25, 2021 at 4:48 PM Daniel Povey ***@***.***> wrote:
I assume the tests succeeded (cd build; ctest)
On Thu, Mar 25, 2021 at 4:33 PM Rudolf A. Braun ***@***.***>
wrote:
> Yeah that was with, just tried without:
>
> # python ./mmi_bigram_train.py --world_size 1
> [F] /idiap/temp/rbraun/code/k2/k2/csrc/eval.h:134:void k2::EvalDevice(cudaStream_t, int32_t, LambdaT&) [with LambdaT = __nv_dl_wrapper_t<__nv_dl_tag<void (k2::Ar
> ray1<char>::*)(char), &k2::Array1<char>::operator=, 1>, char*, const char>; cudaStream_t = CUstream_st*; int32_t = int] Check failed: e == cudaSuccess (700 vs. 0
> ) Error: an illegal memory access was encountered.
>
>
> [ Stack-Trace: ]
> /idiap/temp/rbraun/code/k2/build/lib/libk2_log.so(k2::internal::GetStackTrace()+0x34) [0x7efb7920a084]
> /idiap/temp/rbraun/code/k2/build/lib/libk2context.so(k2::internal::Logger::~Logger()+0x2a) [0x7efb7937b78a]
> /idiap/temp/rbraun/code/k2/build/lib/libk2context.so(void k2::EvalDevice<__nv_dl_wrapper_t<__nv_dl_tag<void (k2::Array1<char>::*)(char), &k2::Array1<char>::opera
> tor=, 1u>, char*, char const> >(CUstream_st*, int, __nv_dl_wrapper_t<__nv_dl_tag<void (k2::Array1<char>::*)(char), &k2::Array1<char>::operator=, 1u>, char*, char
> const>&)+0x16c) [0x7efb793cb65c]
> /idiap/temp/rbraun/code/k2/build/lib/libk2context.so(k2::GetFsaVecBasicProperties(k2::Ragged<k2::Arc>&, k2::Array1<int>*, int*)+0x1138) [0x7efb793bd728]
> /idiap/temp/rbraun/code/k2/build/lib/libk2context.so(k2::GetFsaBasicProperties(k2::Ragged<k2::Arc> const&)+0x85) [0x7efb793bdde5]
> /idiap/temp/rbraun/programs/anaconda3/envs/ten/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0x4eba7) [0x7efb804dbba7]
> /idiap/temp/rbraun/programs/anaconda3/envs/ten/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0x23cd5) [0x7efb804b0cd5]
> python(PyCFunction_Call+0x58) [0x562aae5382d8]
> python(_PyObject_MakeTpCall+0x23c) [0x562aae527edc]
> python(_PyEval_EvalFrameDefault+0x45a9) [0x562aae5b3879]
> python(+0x19ad6a) [0x562aae57ed6a]
> python(_PyObject_GenericGetAttrWithDict+0xfb) [0x562aae464beb]
> python(+0x15c09d) [0x562aae54009d]
> python(_PyEval_EvalFrameDefault+0x9be) [0x562aae5afc8e]
> python(+0x19a85b) [0x562aae57e85b]
> python(+0x1039bd) [0x562aae4e79bd]
> python(_PyEval_EvalCodeWithName+0x300) [0x562aae57d760]
> python(_PyFunction_Vectorcall+0x1e3) [0x562aae57e593]
> python(_PyObject_Call_Prepend+0x291) [0x562aae57f161]
> python(+0x19b48a) [0x562aae57f48a]
> python(_PyObject_MakeTpCall+0x1ae) [0x562aae527e4e]
> python(_PyEval_EvalFrameDefault+0x11dd) [0x562aae5b04ad]
> python(_PyFunction_Vectorcall+0x10b) [0x562aae57e4bb]
> python(+0x10425f) [0x562aae4e825f]
> python(_PyEval_EvalCodeWithName+0x300) [0x562aae57d760]
> python(PyEval_EvalCode+0x23) [0x562aae6124e3]
> python(+0x22e584) [0x562aae612584]
> python(+0x2547c4) [0x562aae6387c4]
> python(+0x115620) [0x562aae4f9620]
> python(PyRun_SimpleFileExFlags+0x384) [0x562aae4fc362]
> python(+0x118e80) [0x562aae4fce80]
> python(Py_BytesMain+0x39) [0x562aae63b979]
> /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xeb) [0x7efc032e909b]
> python(+0x1e7185) [0x562aae5cb185]
>
> World size: 1 Rank: 0
> Traceback (most recent call last):
> File "./mmi_bigram_train.py", line 559, in <module>
> main()
> File "./mmi_bigram_train.py", line 339, in main
> graph_compiler = MmiTrainingGraphCompiler(
> File "/remote/idiap.svm/temp.speech01/rbraun/code/snowfall/snowfall/training/mmi_graph.py", line 84, in __init__
> self.ctc_topo_inv = k2.arc_sort(ctc_topo.invert_())
> File "/idiap/temp/rbraun/programs/anaconda3/envs/ten/lib/python3.8/site-packages/k2/fsa.py", line 816, in invert_
> self.properties
> File "/idiap/temp/rbraun/programs/anaconda3/envs/ten/lib/python3.8/site-packages/k2/fsa.py", line 367, in properties
> properties = _k2.get_fsa_basic_properties(self.arcs)
> RuntimeError: Some bad things happed.
> # Accounting: time=23 threads=1
> # Finished at Thu Mar 25 09:32:09 CET 2021 with status 1
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#696 (comment)>, or
> unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AAZFLO6JRGEWCXF4D64V2RDTFLYOTANCNFSM4ZXIV2UQ>
> .
>
|
Mm, it seems to be failing right at the start of
GetFsaVecBasicProperties(), and since your k2 is compiled in release mode
it's not syncing kernels,
so it's possibly a previous error, i.e. from whatever was called prior to
that.
It may be possible to run that command prefixed by "nsys profile" to get a
profile file, which might show what was run before that, if anything, within
that process, that might have caused the problem. (You have to then access
the .qdrep file on a desktop, e.g. a mac or windows,
using NVidia Nsight Systems).
I'm afraid this might not be easy to debug.
You could see if it also happens when run interactively from the command
line.. it could be something to do with the scheduling system.
…On Thu, Mar 25, 2021 at 4:55 PM Daniel Povey ***@***.***> wrote:
Can you do
export CUDA_VISIBLE_DEVICES=0
or something like that, to make sure it's only trying to use one GPU?
I'm sure it wouldn't use >1, but just want to be more certain..
possibly running it in cuda-gdb might help to do things like print out the
address of the pointer it's failing on, or figure out why it doesn't
seem to be a valid memory pointer in whatever context it's running in.
On Thu, Mar 25, 2021 at 4:48 PM Daniel Povey ***@***.***> wrote:
> I assume the tests succeeded (cd build; ctest)
>
> On Thu, Mar 25, 2021 at 4:33 PM Rudolf A. Braun ***@***.***>
> wrote:
>
>> Yeah that was with, just tried without:
>>
>> # python ./mmi_bigram_train.py --world_size 1
>> [F] /idiap/temp/rbraun/code/k2/k2/csrc/eval.h:134:void k2::EvalDevice(cudaStream_t, int32_t, LambdaT&) [with LambdaT = __nv_dl_wrapper_t<__nv_dl_tag<void (k2::Ar
>> ray1<char>::*)(char), &k2::Array1<char>::operator=, 1>, char*, const char>; cudaStream_t = CUstream_st*; int32_t = int] Check failed: e == cudaSuccess (700 vs. 0
>> ) Error: an illegal memory access was encountered.
>>
>>
>> [ Stack-Trace: ]
>> /idiap/temp/rbraun/code/k2/build/lib/libk2_log.so(k2::internal::GetStackTrace()+0x34) [0x7efb7920a084]
>> /idiap/temp/rbraun/code/k2/build/lib/libk2context.so(k2::internal::Logger::~Logger()+0x2a) [0x7efb7937b78a]
>> /idiap/temp/rbraun/code/k2/build/lib/libk2context.so(void k2::EvalDevice<__nv_dl_wrapper_t<__nv_dl_tag<void (k2::Array1<char>::*)(char), &k2::Array1<char>::opera
>> tor=, 1u>, char*, char const> >(CUstream_st*, int, __nv_dl_wrapper_t<__nv_dl_tag<void (k2::Array1<char>::*)(char), &k2::Array1<char>::operator=, 1u>, char*, char
>> const>&)+0x16c) [0x7efb793cb65c]
>> /idiap/temp/rbraun/code/k2/build/lib/libk2context.so(k2::GetFsaVecBasicProperties(k2::Ragged<k2::Arc>&, k2::Array1<int>*, int*)+0x1138) [0x7efb793bd728]
>> /idiap/temp/rbraun/code/k2/build/lib/libk2context.so(k2::GetFsaBasicProperties(k2::Ragged<k2::Arc> const&)+0x85) [0x7efb793bdde5]
>> /idiap/temp/rbraun/programs/anaconda3/envs/ten/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0x4eba7) [0x7efb804dbba7]
>> /idiap/temp/rbraun/programs/anaconda3/envs/ten/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0x23cd5) [0x7efb804b0cd5]
>> python(PyCFunction_Call+0x58) [0x562aae5382d8]
>> python(_PyObject_MakeTpCall+0x23c) [0x562aae527edc]
>> python(_PyEval_EvalFrameDefault+0x45a9) [0x562aae5b3879]
>> python(+0x19ad6a) [0x562aae57ed6a]
>> python(_PyObject_GenericGetAttrWithDict+0xfb) [0x562aae464beb]
>> python(+0x15c09d) [0x562aae54009d]
>> python(_PyEval_EvalFrameDefault+0x9be) [0x562aae5afc8e]
>> python(+0x19a85b) [0x562aae57e85b]
>> python(+0x1039bd) [0x562aae4e79bd]
>> python(_PyEval_EvalCodeWithName+0x300) [0x562aae57d760]
>> python(_PyFunction_Vectorcall+0x1e3) [0x562aae57e593]
>> python(_PyObject_Call_Prepend+0x291) [0x562aae57f161]
>> python(+0x19b48a) [0x562aae57f48a]
>> python(_PyObject_MakeTpCall+0x1ae) [0x562aae527e4e]
>> python(_PyEval_EvalFrameDefault+0x11dd) [0x562aae5b04ad]
>> python(_PyFunction_Vectorcall+0x10b) [0x562aae57e4bb]
>> python(+0x10425f) [0x562aae4e825f]
>> python(_PyEval_EvalCodeWithName+0x300) [0x562aae57d760]
>> python(PyEval_EvalCode+0x23) [0x562aae6124e3]
>> python(+0x22e584) [0x562aae612584]
>> python(+0x2547c4) [0x562aae6387c4]
>> python(+0x115620) [0x562aae4f9620]
>> python(PyRun_SimpleFileExFlags+0x384) [0x562aae4fc362]
>> python(+0x118e80) [0x562aae4fce80]
>> python(Py_BytesMain+0x39) [0x562aae63b979]
>> /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xeb) [0x7efc032e909b]
>> python(+0x1e7185) [0x562aae5cb185]
>>
>> World size: 1 Rank: 0
>> Traceback (most recent call last):
>> File "./mmi_bigram_train.py", line 559, in <module>
>> main()
>> File "./mmi_bigram_train.py", line 339, in main
>> graph_compiler = MmiTrainingGraphCompiler(
>> File "/remote/idiap.svm/temp.speech01/rbraun/code/snowfall/snowfall/training/mmi_graph.py", line 84, in __init__
>> self.ctc_topo_inv = k2.arc_sort(ctc_topo.invert_())
>> File "/idiap/temp/rbraun/programs/anaconda3/envs/ten/lib/python3.8/site-packages/k2/fsa.py", line 816, in invert_
>> self.properties
>> File "/idiap/temp/rbraun/programs/anaconda3/envs/ten/lib/python3.8/site-packages/k2/fsa.py", line 367, in properties
>> properties = _k2.get_fsa_basic_properties(self.arcs)
>> RuntimeError: Some bad things happed.
>> # Accounting: time=23 threads=1
>> # Finished at Thu Mar 25 09:32:09 CET 2021 with status 1
>>
>> —
>> You are receiving this because you were mentioned.
>> Reply to this email directly, view it on GitHub
>> <#696 (comment)>, or
>> unsubscribe
>> <https://github.com/notifications/unsubscribe-auth/AAZFLO6JRGEWCXF4D64V2RDTFLYOTANCNFSM4ZXIV2UQ>
>> .
>>
>
|
Some extra info I compiled without release mode and ran the tests and got a few timeout failures. Don't have a GPU available locally ( unfortunately GPU nodes where I am do not have cmake..), could that be the issue (since the tests time-out)?
This is what I get while running with CUDA_VISIBLE_DEVICES=0 and no release mode:
|
When running with |
Can you show one or two of those __shared__ read things? it might be a
false alarm from some torch code or from iostream code.
That illegal memory access looks to me like a compilation issue or some
kind of confusion about GPU context.
The last cuda-memcheck warning or error before it dies might say something.
…On Thu, Mar 25, 2021 at 6:46 PM Rudolf A. Braun ***@***.***> wrote:
When running with cuda-memcheck I get a bunch of Invalid __shared__ read
of size 4. Not sure if I should paste the whole log it is rather massive.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#696 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZFLO6E2YUMOKYYFJJ54VDTFMIAJANCNFSM4ZXIV2UQ>
.
|
sample
sample 2 (they all look the same)
final output
|
See if you can get cuda-gdb to stop on invalid shared read or show more
information about the host stack at launch time...
basically I need some kind of line numbers. This could actually be a bug,
one that for some reason didn't show up before.
…On Thu, Mar 25, 2021 at 8:13 PM Rudolf A. Braun ***@***.***> wrote:
sample
```========= Invalid __shared__ read of size 4
========= at 0x00000db8 in _ZN4mgpu16launch_box_cta_kINS_12launch_box_tIJNS_7arch_20INS_12launch_cta_tILi128ELi11ELi9ELi0EEENS_7empty_tEEENS_7arch_35INS3_ILi
128ELi7ELi5ELi0EEES5_EENS_7arch_52IS4_S5_EEEEEZNS_13transform_lbsIS5_ZNS_13transform_lbsIS5_ZNS_19load_balance_searchIS5_PKiPiEEviT0_iT1_RNS_9context_tEEUliiiE_S
H_JEEEvSJ_iSK_iSM_DpT2_EUliiiNS_5tupleIJEEEE_SH_SR_JEEEvSJ_iSK_iSO_SM_DpT3_EUliiE_JEEEvSJ_iDpSK_
========= by thread (64,0,0) in block (3,0,0)
========= Address 0x00000e08 is out of bounds
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame:/lib/x86_64-linux-gnu/libcuda.so.1 (cuLaunchKernel + 0x2b8) [0x1e5b88]
========= Host Frame:/lib/x86_64-linux-gnu/libcudart.so.11.0 [0x101cb]
========= Host Frame:/lib/x86_64-linux-gnu/libcudart.so.11.0 (cudaLaunchKernel + 0x1b5) [0x53765]
========= Host Frame:/idiap/temp/rbraun/code/k2/build/lib/libk2context.so [0x40b669]
========= Host Frame:/idiap/temp/rbraun/code/k2/build/lib/libk2context.so [0x40adae]
========= Host Frame:/idiap/temp/rbraun/code/k2/build/lib/libk2context.so [0x40ade2]
========= Host Frame:/idiap/temp/rbraun/code/k2/build/lib/libk2context.so (_ZN4mgpu16launch_box_cta_kINS_12launch_box_tIJNS_7arch_20INS_12launch_cta_tILi128E
Li11ELi9ELi0EEENS_7empty_tEEENS_7arch_35INS3_ILi128ELi7ELi5ELi0EEES5_EENS_7arch_52IS4_S5_EEEEE17__nv_dl_wrapper_tI11__nv_dl_tagIPFvSD_ISE_IPFvSD_ISE_IPFviPKiiPiR
NS_9context_tEEXadL_ZNS_19load_balance_searchIS5_SG_SH_EEviT0_iT1_SJ_EELj1EEJSH_EEiSG_iSJ_EXadL_ZNS_13transform_lbsIS5_SQ_SG_JEEEvSN_iSO_iSJ_DpT2_EELj1EEJSQ_EEiS
G_iNS_5tupleIJEEESJ_EXadL_ZNS_13transform_lbsIS5_SX_SG_SZ_JEEEvSN_iSO_iT2_SJ_DpT3_EELj1EEJiSG_iSG_SZ_SX_EEJEEEvSN_iDpT1_ + 0x1b) [0x40bf7d]
========= Host Frame:/idiap/temp/rbraun/code/k2/build/lib/libk2context.so (_ZN4mgpu10cta_launchINS_12launch_box_tIJNS_7arch_20INS_12launch_cta_tILi128ELi11EL
i9ELi0EEENS_7empty_tEEENS_7arch_35INS3_ILi128ELi7ELi5ELi0EEES5_EENS_7arch_52IS4_S5_EEEEE17__nv_dl_wrapper_tI11__nv_dl_tagIPFvSD_ISE_IPFvSD_ISE_IPFviPKiiPiRNS_9co
ntext_tEEXadL_ZNS_19load_balance_searchIS5_SG_SH_EEviT0_iT1_SJ_EELj1EEJSH_EEiSG_iSJ_EXadL_ZNS_13transform_lbsIS5_SQ_SG_JEEEvSN_iSO_iSJ_DpT2_EELj1EEJSQ_EEiSG_iNS_
5tupleIJEEESJ_EXadL_ZNS_13transform_lbsIS5_SX_SG_SZ_JEEEvSN_iSO_iT2_SJ_DpT3_EELj1EEJiSG_iSG_SZ_SX_EEJEEEvSN_iSJ_DpT1_ + 0x14c) [0x40cd60]
========= Host Frame:/idiap/temp/rbraun/code/k2/build/lib/libk2context.so (_ZN4mgpu13cta_transformINS_12launch_box_tIJNS_7arch_20INS_12launch_cta_tILi128ELi1
1ELi9ELi0EEENS_7empty_tEEENS_7arch_35INS3_ILi128ELi7ELi5ELi0EEES5_EENS_7arch_52IS4_S5_EEEEE17__nv_dl_wrapper_tI11__nv_dl_tagIPFvSD_ISE_IPFvSD_ISE_IPFviPKiiPiRNS_
9context_tEEXadL_ZNS_19load_balance_searchIS5_SG_SH_EEviT0_iT1_SJ_EELj1EEJSH_EEiSG_iSJ_EXadL_ZNS_13transform_lbsIS5_SQ_SG_JEEEvSN_iSO_iSJ_DpT2_EELj1EEJSQ_EEiSG_i
NS_5tupleIJEEESJ_EXadL_ZNS_13transform_lbsIS5_SX_SG_SZ_JEEEvSN_iSO_iT2_SJ_DpT3_EELj1EEJiSG_iSG_SZ_SX_EEJEEEvSN_iSJ_DpT1_ + 0x83) [0x40c96f]
========= Host Frame:/idiap/temp/rbraun/code/k2/build/lib/libk2context.so (_ZN4mgpu13transform_lbsINS_7empty_tE17__nv_dl_wrapper_tI11__nv_dl_tagIPFvS2_IS3_IP
FviPKiiPiRNS_9context_tEEXadL_ZNS_19load_balance_searchIS1_S5_S6_EEviT0_iT1_S8_EELj1EEJS6_EEiS5_iS8_EXadL_ZNS_13transform_lbsIS1_SF_S5_JEEEvSC_iSD_iS8_DpT2_EELj1
EEJSF_EES5_NS_5tupleIJEEEJEEEvSC_iSD_iT2_S8_DpT3_ + 0xf8) [0x40bc09]
sample 2 (they all look the same)
========= Invalid *shared* read of size 4
========= at 0x00000db8 in
*ZN4mgpu16launch_box_cta_kINS_12launch_box_tIJNS_7arch_20INS_12launch_cta_tILi128ELi11ELi9ELi0EEENS_7empty_tEEENS_7arch_35INS3_ILi
128ELi7ELi5ELi0EEES5_EENS_7arch_52IS4_S5_EEEEEZNS_13transform_lbsIS5_ZNS_13transform_lbsIS5_ZNS_19load_balance_searchIS5_PKiPiEEviT0_iT1_RNS_9context_tEEUliiiE_S
H_JEEEvSJ_iSK_iSM_DpT2_EUliiiNS_5tupleIJEEEE_SH_SR_JEEEvSJ_iSK_iSO_SM_DpT3_EUliiE_JEEEvSJ_iDpSK*
========= by thread (37,0,0) in block (2,0,0)
========= Address 0x00000e08 is out of bounds
========= Saved host backtrace up to driver entry point at kernel launch
time
========= Host Frame:/lib/x86_64-linux-gnu/libcuda.so.1 (cuLaunchKernel +
0x2b8) [0x1e5b88]
========= Host Frame:/lib/x86_64-linux-gnu/libcudart.so.11.0 [0x101cb]
========= Host Frame:/lib/x86_64-linux-gnu/libcudart.so.11.0
(cudaLaunchKernel + 0x1b5) [0x53765]
========= Host Frame:/idiap/temp/rbraun/code/k2/build/lib/libk2context.so
[0x40b669]
========= Host Frame:/idiap/temp/rbraun/code/k2/build/lib/libk2context.so
[0x40adae]
========= Host Frame:/idiap/temp/rbraun/code/k2/build/lib/libk2context.so
[0x40ade2]
========= Host Frame:/idiap/temp/rbraun/code/k2/build/lib/libk2context.so (
*ZN4mgpu16launch_box_cta_kINS_12launch_box_tIJNS_7arch_20INS_12launch_cta_tILi128E
Li11ELi9ELi0EEENS_7empty_tEEENS_7arch_35INS3_ILi128ELi7ELi5ELi0EEES5_EENS_7arch_52IS4_S5_EEEEE17__nv_dl_wrapper_tI11__nv_dl_tagIPFvSD_ISE_IPFvSD_ISE_IPFviPKiiPiR
NS_9context_tEEXadL_ZNS_19load_balance_searchIS5_SG_SH_EEviT0_iT1_SJ_EELj1EEJSH_EEiSG_iSJ_EXadL_ZNS_13transform_lbsIS5_SQ_SG_JEEEvSN_iSO_iSJ_DpT2_EELj1EEJSQ_EEiS
G_iNS_5tupleIJEEESJ_EXadL_ZNS_13transform_lbsIS5_SX_SG_SZ_JEEEvSN_iSO_iT2_SJ_DpT3_EELj1EEJiSG_iSG_SZ_SX_EEJEEEvSN_iDpT1*
+ 0x1b) [0x40bf7d]
final output
[F] /idiap/temp/rbraun/code/k2/k2/csrc/eval.h:134:void
k2::EvalDevice(cudaStream_t, int32_t, LambdaT&) [with LambdaT =
__nv_dl_wrapper_t<__nv_dl_tag<void (
*)(k2: :Raggedk2::Arc&, k2::Array1*, int*), k2::GetFsaVecBasicProperties,
1>, k2::Arc*, const int*, const int*, const int*, const int*, char*, int,
int*>; cudaSt
ream_t = CUstream_st*; int32_t = int] Check failed: e == cudaSuccess (719
vs. 0) Error: unspecified launch failure.
[ Stack-Trace: ]
/idiap/temp/rbraun/code/k2/build/lib/libk2_log.so(k2::internal::GetStackTrace()+0x46)
[0x7f0348134edd]
/idiap/temp/rbraun/code/k2/build/lib/libk2context.so(k2::internal::Logger::~Logger()+0x35)
[0x7f03484739f3]
/idiap/temp/rbraun/code/k2/build/lib/libk2context.so(void
k2::EvalDevice<__nv_dl_wrapper_t<__nv_dl_tag<void (*)(k2::Raggedk2::Arc&,
k2::Array1*, int*), &k
2::GetFsaVecBasicProperties, 1u>, k2::Arc*, int const*, int const*, int
const*, int const*, char*, int, int*> >(CUstream_st*, int,
__nv_dl_wrapper_t<__nv_dl_tag<
void (*)(k2::Raggedk2::Arc&, k2::Array1*, int*),
&k2::GetFsaVecBasicProperties, 1u>, k2::Arc*, int const*, int const*, int
const*, int const*, char*, int,
int*>&)+0x354) [0x7f03484c44d7]
/idiap/temp/rbraun/code/k2/build/lib/libk2context.so(void
k2::EvalDevice<std::shared_ptrk2::Context,
__nv_dl_wrapper_t<__nv_dl_tag<void (
*)(k2::Raggedk2::Arc &, k2::Array1*, int*),
&k2::GetFsaVecBasicProperties, 1u>, k2::Arc*, int const*, int const*, int
const*, int const*, char*, int, int*> >(std::shared_ptr<k2:
:Context>, int, __nv_dl_wrapper_t<__nv_dl_tag<void (*)(k2::Raggedk2::Arc&,
k2::Array1*, int*), &k2::GetFsaVecBasicProperties, 1u>, k2::Arc*, int
const*, i
nt const*, int const*, int const*, char*, int, int*>&)+0x42)
[0x7f03484bfe64]
/idiap/temp/rbraun/code/k2/build/lib/libk2context.so(k2::GetFsaVecBasicProperties(k2::Raggedk2::Arc&,
k2::Array1*, int*)+0x3de) [0x7f03484b7095]
/idiap/temp/rbraun/code/k2/build/lib/libk2context.so(k2::GetFsaBasicProperties(k2::Raggedk2::Arc
const&)+0x7f) [0x7f03484b78f3]
/idiap/temp/rbraun/programs/anaconda3/envs/ten/lib/python3.8/site-packages/_
k2.cpython-38-x86_64-linux-gnu.so(+0xb4409) [0x7f034f8ab409]
/idiap/temp/rbraun/programs/anaconda3/envs/ten/lib/python3.8/site-packages/_
k2.cpython-38-x86_64-linux-gnu.so(+0xb054d) [0x7f034f8a754d]
/idiap/temp/rbraun/programs/anaconda3/envs/ten/lib/python3.8/site-packages/_
k2.cpython-38-x86_64-linux-gnu.so(+0xabf80) [0x7f034f8a2f80]
/idiap/temp/rbraun/programs/anaconda3/envs/ten/lib/python3.8/site-packages/_
k2.cpython-38-x86_64-linux-gnu.so(+0xac08f) [0x7f034f8a308f]
/idiap/temp/rbraun/programs/anaconda3/envs/ten/lib/python3.8/site-packages/_
k2.cpython-38-x86_64-linux-gnu.so(+0x514e2) [0x7f034f8484e2]
python(PyCFunction_Call+0x58) [0x55a6e2a402d8]
python(_PyObject_MakeTpCall+0x23c) [0x55a6e2a2fedc]
python(_PyEval_EvalFrameDefault+0x45a9) [0x55a6e2abb879]
python(+0x19ad6a) [0x55a6e2a86d6a]
python(_PyObject_GenericGetAttrWithDict+0xfb) [0x55a6e296cbeb]
python(+0x15c09d) [0x55a6e2a4809d]
python(_PyEval_EvalFrameDefault+0x9be)
[0x55a6e2ab7c8e]python(_PyEval_EvalCodeWithName+0x300) [0x55a6e2a85760]
python(_PyFunction_Vectorcall+0x1e3) [0x55a6e2a86593]
python(_PyObject_Call_Prepend+0x291) [0x55a6e2a87161]
python(+0x19b48a) [0x55a6e2a8748a]
python(_PyObject_MakeTpCall+0x1ae) [0x55a6e2a2fe4e]
python(_PyEval_EvalFrameDefault+0x11dd) [0x55a6e2ab84ad]
python(_PyFunction_Vectorcall+0x10b) [0x55a6e2a864bb]
python(+0x10425f) [0x55a6e29f025f]
python(_PyEval_EvalCodeWithName+0x300) [0x55a6e2a85760]
python(PyEval_EvalCode+0x23) [0x55a6e2b1a4e3]
python(+0x22e584) [0x55a6e2b1a584]
python(+0x2547c4) [0x55a6e2b407c4]
python(+0x115620) [0x55a6e2a01620]
python(PyRun_SimpleFileExFlags+0x384) [0x55a6e2a04362]
python(+0x118e80) [0x55a6e2a04e80]
python(Py_BytesMain+0x39) [0x55a6e2b43979]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xeb) [0x7f040dbd609b]
python(+0x1e7185) [0x55a6e2ad3185]
Traceback (most recent call last):
File "mmi_bigram_train.py", line 559, in
main()
File "mmi_bigram_train.py", line 339, in main
graph_compiler = MmiTrainingGraphCompiler(
File
"/idiap/temp/rbraun/programs/anaconda3/envs/ten/lib/python3.8/site-packages/snowfall/training/mmi_graph.py",
line 84, in *init*
self.ctc_topo_inv = k2.arc_sort(ctc_topo.invert_())
File
"/idiap/temp/rbraun/programs/anaconda3/envs/ten/lib/python3.8/site-packages/k2/fsa.py",
line 816, in invert_
self.properties
File
"/idiap/temp/rbraun/programs/anaconda3/envs/ten/lib/python3.8/site-packages/k2/fsa.py",
line 367, in properties
properties = _k2.get_fsa_basic_properties(self.arcs)
RuntimeError: Some bad things happed.
========= ERROR SUMMARY: 192 errors
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#696 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZFLOZIKESG3VDJGIXB4JTTFMSHNANCNFSM4ZXIV2UQ>
.
|
... I mean where it fails in transform_lbs. Need to figure out the context
within k2.
…On Thu, Mar 25, 2021 at 8:37 PM Daniel Povey ***@***.***> wrote:
See if you can get cuda-gdb to stop on invalid shared read or show more
information about the host stack at launch time...
basically I need some kind of line numbers. This could actually be a
bug, one that for some reason didn't show up before.
On Thu, Mar 25, 2021 at 8:13 PM Rudolf A. Braun ***@***.***>
wrote:
> sample
>
>
> ```========= Invalid __shared__ read of size 4
> ========= at 0x00000db8 in _ZN4mgpu16launch_box_cta_kINS_12launch_box_tIJNS_7arch_20INS_12launch_cta_tILi128ELi11ELi9ELi0EEENS_7empty_tEEENS_7arch_35INS3_ILi
> 128ELi7ELi5ELi0EEES5_EENS_7arch_52IS4_S5_EEEEEZNS_13transform_lbsIS5_ZNS_13transform_lbsIS5_ZNS_19load_balance_searchIS5_PKiPiEEviT0_iT1_RNS_9context_tEEUliiiE_S
> H_JEEEvSJ_iSK_iSM_DpT2_EUliiiNS_5tupleIJEEEE_SH_SR_JEEEvSJ_iSK_iSO_SM_DpT3_EUliiE_JEEEvSJ_iDpSK_
> ========= by thread (64,0,0) in block (3,0,0)
> ========= Address 0x00000e08 is out of bounds
> ========= Saved host backtrace up to driver entry point at kernel launch time
> ========= Host Frame:/lib/x86_64-linux-gnu/libcuda.so.1 (cuLaunchKernel + 0x2b8) [0x1e5b88]
> ========= Host Frame:/lib/x86_64-linux-gnu/libcudart.so.11.0 [0x101cb]
> ========= Host Frame:/lib/x86_64-linux-gnu/libcudart.so.11.0 (cudaLaunchKernel + 0x1b5) [0x53765]
> ========= Host Frame:/idiap/temp/rbraun/code/k2/build/lib/libk2context.so [0x40b669]
> ========= Host Frame:/idiap/temp/rbraun/code/k2/build/lib/libk2context.so [0x40adae]
> ========= Host Frame:/idiap/temp/rbraun/code/k2/build/lib/libk2context.so [0x40ade2]
> ========= Host Frame:/idiap/temp/rbraun/code/k2/build/lib/libk2context.so (_ZN4mgpu16launch_box_cta_kINS_12launch_box_tIJNS_7arch_20INS_12launch_cta_tILi128E
> Li11ELi9ELi0EEENS_7empty_tEEENS_7arch_35INS3_ILi128ELi7ELi5ELi0EEES5_EENS_7arch_52IS4_S5_EEEEE17__nv_dl_wrapper_tI11__nv_dl_tagIPFvSD_ISE_IPFvSD_ISE_IPFviPKiiPiR
> NS_9context_tEEXadL_ZNS_19load_balance_searchIS5_SG_SH_EEviT0_iT1_SJ_EELj1EEJSH_EEiSG_iSJ_EXadL_ZNS_13transform_lbsIS5_SQ_SG_JEEEvSN_iSO_iSJ_DpT2_EELj1EEJSQ_EEiS
> G_iNS_5tupleIJEEESJ_EXadL_ZNS_13transform_lbsIS5_SX_SG_SZ_JEEEvSN_iSO_iT2_SJ_DpT3_EELj1EEJiSG_iSG_SZ_SX_EEJEEEvSN_iDpT1_ + 0x1b) [0x40bf7d]
> ========= Host Frame:/idiap/temp/rbraun/code/k2/build/lib/libk2context.so (_ZN4mgpu10cta_launchINS_12launch_box_tIJNS_7arch_20INS_12launch_cta_tILi128ELi11EL
> i9ELi0EEENS_7empty_tEEENS_7arch_35INS3_ILi128ELi7ELi5ELi0EEES5_EENS_7arch_52IS4_S5_EEEEE17__nv_dl_wrapper_tI11__nv_dl_tagIPFvSD_ISE_IPFvSD_ISE_IPFviPKiiPiRNS_9co
> ntext_tEEXadL_ZNS_19load_balance_searchIS5_SG_SH_EEviT0_iT1_SJ_EELj1EEJSH_EEiSG_iSJ_EXadL_ZNS_13transform_lbsIS5_SQ_SG_JEEEvSN_iSO_iSJ_DpT2_EELj1EEJSQ_EEiSG_iNS_
> 5tupleIJEEESJ_EXadL_ZNS_13transform_lbsIS5_SX_SG_SZ_JEEEvSN_iSO_iT2_SJ_DpT3_EELj1EEJiSG_iSG_SZ_SX_EEJEEEvSN_iSJ_DpT1_ + 0x14c) [0x40cd60]
> ========= Host Frame:/idiap/temp/rbraun/code/k2/build/lib/libk2context.so (_ZN4mgpu13cta_transformINS_12launch_box_tIJNS_7arch_20INS_12launch_cta_tILi128ELi1
> 1ELi9ELi0EEENS_7empty_tEEENS_7arch_35INS3_ILi128ELi7ELi5ELi0EEES5_EENS_7arch_52IS4_S5_EEEEE17__nv_dl_wrapper_tI11__nv_dl_tagIPFvSD_ISE_IPFvSD_ISE_IPFviPKiiPiRNS_
> 9context_tEEXadL_ZNS_19load_balance_searchIS5_SG_SH_EEviT0_iT1_SJ_EELj1EEJSH_EEiSG_iSJ_EXadL_ZNS_13transform_lbsIS5_SQ_SG_JEEEvSN_iSO_iSJ_DpT2_EELj1EEJSQ_EEiSG_i
> NS_5tupleIJEEESJ_EXadL_ZNS_13transform_lbsIS5_SX_SG_SZ_JEEEvSN_iSO_iT2_SJ_DpT3_EELj1EEJiSG_iSG_SZ_SX_EEJEEEvSN_iSJ_DpT1_ + 0x83) [0x40c96f]
> ========= Host Frame:/idiap/temp/rbraun/code/k2/build/lib/libk2context.so (_ZN4mgpu13transform_lbsINS_7empty_tE17__nv_dl_wrapper_tI11__nv_dl_tagIPFvS2_IS3_IP
> FviPKiiPiRNS_9context_tEEXadL_ZNS_19load_balance_searchIS1_S5_S6_EEviT0_iT1_S8_EELj1EEJS6_EEiS5_iS8_EXadL_ZNS_13transform_lbsIS1_SF_S5_JEEEvSC_iSD_iS8_DpT2_EELj1
> EEJSF_EES5_NS_5tupleIJEEEJEEEvSC_iSD_iT2_S8_DpT3_ + 0xf8) [0x40bc09]
> sample 2 (they all look the same)
>
> ========= Invalid *shared* read of size 4
> ========= at 0x00000db8 in
>
> *ZN4mgpu16launch_box_cta_kINS_12launch_box_tIJNS_7arch_20INS_12launch_cta_tILi128ELi11ELi9ELi0EEENS_7empty_tEEENS_7arch_35INS3_ILi
> 128ELi7ELi5ELi0EEES5_EENS_7arch_52IS4_S5_EEEEEZNS_13transform_lbsIS5_ZNS_13transform_lbsIS5_ZNS_19load_balance_searchIS5_PKiPiEEviT0_iT1_RNS_9context_tEEUliiiE_S
> H_JEEEvSJ_iSK_iSM_DpT2_EUliiiNS_5tupleIJEEEE_SH_SR_JEEEvSJ_iSK_iSO_SM_DpT3_EUliiE_JEEEvSJ_iDpSK*
> ========= by thread (37,0,0) in block (2,0,0)
> ========= Address 0x00000e08 is out of bounds
> ========= Saved host backtrace up to driver entry point at kernel launch
> time
> ========= Host Frame:/lib/x86_64-linux-gnu/libcuda.so.1 (cuLaunchKernel +
> 0x2b8) [0x1e5b88]
> ========= Host Frame:/lib/x86_64-linux-gnu/libcudart.so.11.0 [0x101cb]
> ========= Host Frame:/lib/x86_64-linux-gnu/libcudart.so.11.0
> (cudaLaunchKernel + 0x1b5) [0x53765]
> ========= Host Frame:/idiap/temp/rbraun/code/k2/build/lib/libk2context.so
> [0x40b669]
> ========= Host Frame:/idiap/temp/rbraun/code/k2/build/lib/libk2context.so
> [0x40adae]
> ========= Host Frame:/idiap/temp/rbraun/code/k2/build/lib/libk2context.so
> [0x40ade2]
> ========= Host Frame:/idiap/temp/rbraun/code/k2/build/lib/libk2context.so
> (
>
>
> *ZN4mgpu16launch_box_cta_kINS_12launch_box_tIJNS_7arch_20INS_12launch_cta_tILi128E
> Li11ELi9ELi0EEENS_7empty_tEEENS_7arch_35INS3_ILi128ELi7ELi5ELi0EEES5_EENS_7arch_52IS4_S5_EEEEE17__nv_dl_wrapper_tI11__nv_dl_tagIPFvSD_ISE_IPFvSD_ISE_IPFviPKiiPiR
> NS_9context_tEEXadL_ZNS_19load_balance_searchIS5_SG_SH_EEviT0_iT1_SJ_EELj1EEJSH_EEiSG_iSJ_EXadL_ZNS_13transform_lbsIS5_SQ_SG_JEEEvSN_iSO_iSJ_DpT2_EELj1EEJSQ_EEiS
> G_iNS_5tupleIJEEESJ_EXadL_ZNS_13transform_lbsIS5_SX_SG_SZ_JEEEvSN_iSO_iT2_SJ_DpT3_EELj1EEJiSG_iSG_SZ_SX_EEJEEEvSN_iDpT1*
> + 0x1b) [0x40bf7d]
>
> final output
>
> [F] /idiap/temp/rbraun/code/k2/k2/csrc/eval.h:134:void
> k2::EvalDevice(cudaStream_t, int32_t, LambdaT&) [with LambdaT =
> __nv_dl_wrapper_t<__nv_dl_tag<void (
> *)(k2: :Raggedk2::Arc&, k2::Array1*, int*),
> k2::GetFsaVecBasicProperties, 1>, k2::Arc*, const int*, const int*, const
> int*, const int*, char*, int, int*>; cudaSt
> ream_t = CUstream_st*; int32_t = int] Check failed: e == cudaSuccess (719
> vs. 0) Error: unspecified launch failure.
>
> [ Stack-Trace: ]
> /idiap/temp/rbraun/code/k2/build/lib/libk2_log.so(k2::internal::GetStackTrace()+0x46)
> [0x7f0348134edd]
> /idiap/temp/rbraun/code/k2/build/lib/libk2context.so(k2::internal::Logger::~Logger()+0x35)
> [0x7f03484739f3]
> /idiap/temp/rbraun/code/k2/build/lib/libk2context.so(void
> k2::EvalDevice<__nv_dl_wrapper_t<__nv_dl_tag<void (*)(k2::Raggedk2::Arc&,
> k2::Array1*, int*), &k
> 2::GetFsaVecBasicProperties, 1u>, k2::Arc*, int const*, int const*, int
> const*, int const*, char*, int, int*> >(CUstream_st*, int,
> __nv_dl_wrapper_t<__nv_dl_tag<
> void (*)(k2::Raggedk2::Arc&, k2::Array1*, int*),
> &k2::GetFsaVecBasicProperties, 1u>, k2::Arc*, int const*, int const*, int
> const*, int const*, char*, int,
> int*>&)+0x354) [0x7f03484c44d7]
> /idiap/temp/rbraun/code/k2/build/lib/libk2context.so(void
> k2::EvalDevice<std::shared_ptrk2::Context,
> __nv_dl_wrapper_t<__nv_dl_tag<void (
> *)(k2::Raggedk2::Arc &, k2::Array1*, int*),
> &k2::GetFsaVecBasicProperties, 1u>, k2::Arc*, int const*, int const*, int
> const*, int const*, char*, int, int*> >(std::shared_ptr<k2:
> :Context>, int, __nv_dl_wrapper_t<__nv_dl_tag<void (*)(k2::Raggedk2::Arc&,
> k2::Array1*, int*), &k2::GetFsaVecBasicProperties, 1u>, k2::Arc*, int
> const*, i
> nt const*, int const*, int const*, char*, int, int*>&)+0x42)
> [0x7f03484bfe64]
> /idiap/temp/rbraun/code/k2/build/lib/libk2context.so(k2::GetFsaVecBasicProperties(k2::Raggedk2::Arc&,
> k2::Array1*, int*)+0x3de) [0x7f03484b7095]
> /idiap/temp/rbraun/code/k2/build/lib/libk2context.so(k2::GetFsaBasicProperties(k2::Raggedk2::Arc
> const&)+0x7f) [0x7f03484b78f3]
>
> /idiap/temp/rbraun/programs/anaconda3/envs/ten/lib/python3.8/site-packages/_
> k2.cpython-38-x86_64-linux-gnu.so(+0xb4409) [0x7f034f8ab409]
>
> /idiap/temp/rbraun/programs/anaconda3/envs/ten/lib/python3.8/site-packages/_
> k2.cpython-38-x86_64-linux-gnu.so(+0xb054d) [0x7f034f8a754d]
>
> /idiap/temp/rbraun/programs/anaconda3/envs/ten/lib/python3.8/site-packages/_
> k2.cpython-38-x86_64-linux-gnu.so(+0xabf80) [0x7f034f8a2f80]
>
> /idiap/temp/rbraun/programs/anaconda3/envs/ten/lib/python3.8/site-packages/_
> k2.cpython-38-x86_64-linux-gnu.so(+0xac08f) [0x7f034f8a308f]
>
> /idiap/temp/rbraun/programs/anaconda3/envs/ten/lib/python3.8/site-packages/_
> k2.cpython-38-x86_64-linux-gnu.so(+0x514e2) [0x7f034f8484e2]
> python(PyCFunction_Call+0x58) [0x55a6e2a402d8]
> python(_PyObject_MakeTpCall+0x23c) [0x55a6e2a2fedc]
> python(_PyEval_EvalFrameDefault+0x45a9) [0x55a6e2abb879]
> python(+0x19ad6a) [0x55a6e2a86d6a]
> python(_PyObject_GenericGetAttrWithDict+0xfb) [0x55a6e296cbeb]
> python(+0x15c09d) [0x55a6e2a4809d]
> python(_PyEval_EvalFrameDefault+0x9be)
> [0x55a6e2ab7c8e]python(_PyEval_EvalCodeWithName+0x300) [0x55a6e2a85760]
> python(_PyFunction_Vectorcall+0x1e3) [0x55a6e2a86593]
> python(_PyObject_Call_Prepend+0x291) [0x55a6e2a87161]
> python(+0x19b48a) [0x55a6e2a8748a]
> python(_PyObject_MakeTpCall+0x1ae) [0x55a6e2a2fe4e]
> python(_PyEval_EvalFrameDefault+0x11dd) [0x55a6e2ab84ad]
> python(_PyFunction_Vectorcall+0x10b) [0x55a6e2a864bb]
> python(+0x10425f) [0x55a6e29f025f]
> python(_PyEval_EvalCodeWithName+0x300) [0x55a6e2a85760]
> python(PyEval_EvalCode+0x23) [0x55a6e2b1a4e3]
> python(+0x22e584) [0x55a6e2b1a584]
> python(+0x2547c4) [0x55a6e2b407c4]
> python(+0x115620) [0x55a6e2a01620]
> python(PyRun_SimpleFileExFlags+0x384) [0x55a6e2a04362]
> python(+0x118e80) [0x55a6e2a04e80]
> python(Py_BytesMain+0x39) [0x55a6e2b43979]
> /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xeb) [0x7f040dbd609b]
> python(+0x1e7185) [0x55a6e2ad3185]
>
> Traceback (most recent call last):
> File "mmi_bigram_train.py", line 559, in
> main()
> File "mmi_bigram_train.py", line 339, in main
> graph_compiler = MmiTrainingGraphCompiler(
> File
> "/idiap/temp/rbraun/programs/anaconda3/envs/ten/lib/python3.8/site-packages/snowfall/training/mmi_graph.py",
> line 84, in *init*
> self.ctc_topo_inv = k2.arc_sort(ctc_topo.invert_())
> File
> "/idiap/temp/rbraun/programs/anaconda3/envs/ten/lib/python3.8/site-packages/k2/fsa.py",
> line 816, in invert_
> self.properties
> File
> "/idiap/temp/rbraun/programs/anaconda3/envs/ten/lib/python3.8/site-packages/k2/fsa.py",
> line 367, in properties
> properties = _k2.get_fsa_basic_properties(self.arcs)
> RuntimeError: Some bad things happed.
> ========= ERROR SUMMARY: 192 errors
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#696 (comment)>, or
> unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AAZFLOZIKESG3VDJGIXB4JTTFMSHNANCNFSM4ZXIV2UQ>
> .
>
|
From |
Okay I will try, may need a bit |
https://github.com/k2-fsa/k2/issues/696#issuecomment-806060787#695 @RuABraun |
Thanks, Fangjun! |
Seems to work! Just I needed to add |
Hi,
I want to run snowfall's librispeech recipe. I'm getting a crash when training starts that seems to have to do with k2:
This is the output of pytorch's collect_env:
The GPU is a RTX 3090. I built k2 for release as described in the documentation, the version information:
Maybe I should be restricting myself to CUDA 10.2 ?
The text was updated successfully, but these errors were encountered: