[BUG] Rare unspecified launch failure in random projection RandomMatrixCheck test #3561
Labels
? - Needs Triage
Need team to review and classify
bug
Something isn't working
inactive-30d
inactive-90d
Seen in: https://gpuci.gpuopenanalytics.com/job/rapidsai/job/gpuci/job/cuml/job/prb/job/cuml-gpu-test/CUDA=11.0,GPU_LABEL=gpu-a100,OS=ubuntu18.04,PYTHON=3.8/667/
Example failure: ```
23:11:38 [----------] 2 tests from RPROJTestF2
23:11:38 [ RUN ] RPROJTestF2.RandomMatrixCheck
23:11:39 unknown file: Failure
23:11:39 C++ exception with description "CUDA error encountered at: file=/opt/conda/envs/rapids/conda-bld/libcuml_1613537190049/work/cpp/build/raft/src/raft/cpp/include/raft/cudart_utils.h line=205: call='cudaMemcpyAsync(dst, src, len * sizeof(Type), cudaMemcpyDefault, stream)', Reason=cudaErrorLaunchFailure:unspecified launch failure
23:11:39 Obtained 16 stack frames
23:11:39 #0 in ./test/ml(_ZN4raft9exception18collect_call_stackEv+0x3b) [0x4ba9bb]
23:11:39 #1 in ./test/ml(_ZN4raft10cuda_errorC1ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x63) [0x4bb173]
23:11:39 #2 in ./test/ml(_ZN4raft4copyIiEEvPT_PKS1_mP11CUstream_st+0x12c) [0x4ef90c]
23:11:39 #3 in /tmp/workspace/rapidsai/gpuci/cuml/prb/cuml-gpu-test/CUDA/11.0/GPU_LABEL/gpu-a100/OS/ubuntu18.04/PYTHON/3.8/ci/artifacts/cuml/cpu/conda_work/cpp/build/libcuml++.so(_Z8binomialRKN4raft8handle_tEmdi+0x28d) [0x7f8159c7805d]
23:11:39 #4 in /tmp/workspace/rapidsai/gpuci/cuml/prb/cuml-gpu-test/CUDA/11.0/GPU_LABEL/gpu-a100/OS/ubuntu18.04/PYTHON/3.8/ci/artifacts/cuml/cpu/conda_work/cpp/build/libcuml++.so(_ZN2ML20sparse_random_matrixIfEEvRKN4raft8handle_tEPNS_8rand_matIT_EERNS_11paramsRPROJE+0x324) [0x7f8159c7b324]
23:11:39 #5 in /tmp/workspace/rapidsai/gpuci/cuml/prb/cuml-gpu-test/CUDA/11.0/GPU_LABEL/gpu-a100/OS/ubuntu18.04/PYTHON/3.8/ci/artifacts/cuml/cpu/conda_work/cpp/build/libcuml++.so(_ZN2ML8RPROJfitIfEEvRKN4raft8handle_tEPNS_8rand_matIT_EEPNS_11paramsRPROJE+0x27f) [0x7f8159c7ba6f]
23:11:39 #6 in ./test/ml() [0x6b7ac8]
23:11:39 #7 in /opt/conda/envs/rapids/lib/libgtest.so(_ZN7testing8internal35HandleExceptionsInMethodIfSupportedINS_4TestEvEET0_PT_MS4_FS3_vEPKc+0x4e) [0x7f81c995d98e]
23:11:39 #8 in /opt/conda/envs/rapids/lib/libgtest.so(_ZN7testing4Test3RunEv+0x64) [0x7f81c995db64]
23:11:39 #9 in /opt/conda/envs/rapids/lib/libgtest.so(_ZN7testing8TestInfo3RunEv+0x13f) [0x7f81c995df0f]
23:11:39 #10 in /opt/conda/envs/rapids/lib/libgtest.so(_ZN7testing9TestSuite3RunEv+0x106) [0x7f81c995e036]
23:11:39 #11 in /opt/conda/envs/rapids/lib/libgtest.so(_ZN7testing8internal12UnitTestImpl11RunAllTestsEv+0x4dc) [0x7f81c995e5ec]
23:11:39 #12 in /opt/conda/envs/rapids/lib/libgtest.so(_ZN7testing8UnitTest3RunEv+0xd9) [0x7f81c995e859]
23:11:39 #13 in /opt/conda/envs/rapids/lib/libgtest_main.so(main+0x3f) [0x7f81c990d07f]
23:11:39 #14 in /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7) [0x7f8158a3fbf7]
23:11:39 #15 in ./test/ml() [0x4a8ea9]
23:11:39 " thrown in SetUp().
23:11:39 unknown file: Failure
23:11:39 C++ exception with description "CUDA error encountered at: file=/opt/conda/envs/rapids/conda-bld/libcuml_1613537190049/work/cpp/test/sg/rproj_test.cu line=118: call='cudaFree(d_input)', Reason=cudaErrorLaunchFailure:unspecified launch failure
23:11:39 Obtained 11 stack frames
23:11:39 #0 in ./test/ml(_ZN4raft9exception18collect_call_stackEv+0x3b) [0x4ba9bb]
23:11:39 #1 in ./test/ml(_ZN4raft10cuda_errorC1ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x63) [0x4bb173]
23:11:39 #2 in ./test/ml() [0x6b7f73]
23:11:39 #3 in /opt/conda/envs/rapids/lib/libgtest.so(_ZN7testing8internal35HandleExceptionsInMethodIfSupportedINS_4TestEvEET0_PT_MS4_FS3_vEPKc+0x4e) [0x7f81c995d98e]
23:11:39 #4 in /opt/conda/envs/rapids/lib/libgtest.so(_ZN7testing8TestInfo3RunEv+0x13f) [0x7f81c995df0f]
23:11:39 #5 in /opt/conda/envs/rapids/lib/libgtest.so(_ZN7testing9TestSuite3RunEv+0x106) [0x7f81c995e036]
23:11:39 #6 in /opt/conda/envs/rapids/lib/libgtest.so(_ZN7testing8internal12UnitTestImpl11RunAllTestsEv+0x4dc) [0x7f81c995e5ec]
23:11:39 #7 in /opt/conda/envs/rapids/lib/libgtest.so(_ZN7testing8UnitTest3RunEv+0xd9) [0x7f81c995e859]
23:11:39 #8 in /opt/conda/envs/rapids/lib/libgtest_main.so(main+0x3f) [0x7f81c990d07f]
23:11:39 #9 in /lib/x86_64-linux-gnu/libc.so.6(_libc_start_main+0xe7) [0x7f8158a3fbf7]
23:11:39 #10 in ./test/ml() [0x4a8ea9]
23:11:39 " thrown in TearDown().
23:11:39 terminate called after throwing an instance of 'raft::cuda_error'
23:11:39 what(): CUDA error encountered at: file=/opt/conda/envs/rapids/conda-bld/libcuml_1613537190049/work/cpp/build/raft/src/raft/cpp/include/raft/handle.hpp line=246: call='cudaEventDestroy(event)', Reason=cudaErrorLaunchFailure:unspecified launch failure
23:11:39 Obtained 13 stack frames
23:11:39 #0 in ./test/ml(_ZN4raft9exception18collect_call_stackEv+0x3b) [0x4ba9bb]
23:11:39 #1 in ./test/ml(_ZN4raft10cuda_errorC1ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x63) [0x4bb173]
23:11:39 #2 in ./test/ml(_ZN4raft8handle_t17destroy_resourcesEv+0x74c) [0x4bd7bc]
23:11:39 #3 in ./test/ml(_ZN4raft8handle_tD2Ev+0x17) [0x4be537]
23:11:39 #4 in ./test/ml() [0x6b6b05]
23:11:39 #5 in /opt/conda/envs/rapids/lib/libgtest.so(_ZN7testing8internal35HandleExceptionsInMethodIfSupportedINS_4TestEvEET0_PT_MS4_FS3_vEPKc+0x4e) [0x7f81c995d98e]
23:11:39 #6 in /opt/conda/envs/rapids/lib/libgtest.so(_ZN7testing8TestInfo3RunEv+0xe7) [0x7f81c995deb7]
23:11:39 #7 in /opt/conda/envs/rapids/lib/libgtest.so(_ZN7testing9TestSuite3RunEv+0x106) [0x7f81c995e036]
23:11:39 #8 in /opt/conda/envs/rapids/lib/libgtest.so(_ZN7testing8internal12UnitTestImpl11RunAllTestsEv+0x4dc) [0x7f81c995e5ec]
23:11:39 #9 in /opt/conda/envs/rapids/lib/libgtest.so(_ZN7testing8UnitTest3RunEv+0xd9) [0x7f81c995e859]
23:11:39 #10 in /opt/conda/envs/rapids/lib/libgtest_main.so(main+0x3f) [0x7f81c990d07f]
23:11:39 #11 in /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7) [0x7f8158a3fbf7]
23:11:39 #12 in ./test/ml() [0x4a8ea9]
23:11:39
23:11:40 ci/gpu/build.sh: line 242: 8128 Aborted (core dumped) GTEST_OUTPUT="xml:${WORKSPACE}/test-results/libcuml_cpp/" ./test/ml
23:11:40 Build step 'Execute shell' marked build as failur
The text was updated successfully, but these errors were encountered: