[BUG] Sporadic pytest crash in hdbscan #3997

dantegd · 2021-06-17T15:50:24Z

Describe the bug
Errors in a couple of hdbscan tests:

cuml/test/test_hdbscan.py::test_hdbscan_cluster_patterns[knn-eom-0-False-25-0.0-15-noisy_circles-1000]
...
terminate called after throwing an instance of 'thrust::system::system_error'
  what():  transform: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered

cuml/test/test_hdbscan.py::test_hdbscan_cluster_patterns[knn-eom-0-True-25-0.0-15-noisy_moons-1000]
...
terminate called after throwing an instance of 'thrust::system::system_error'
  what():  transform: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered

Found the logs:

https://gpuci.gpuopenanalytics.com/job/rapidsai/job/gpuci/job/cuml/job/prb/job/cuml-gpu-test/1048/CUDA=11.2,GPU_LABEL=gpu,OS=ubuntu18.04,PYTHON=3.8/

https://gpuci.gpuopenanalytics.com/job/rapidsai/job/gpuci/job/cuml/job/prb/job/cuml-gpu-test/1048/CUDA=11.2,GPU_LABEL=gpu,OS=centos7,PYTHON=3.8/

Click here to see full error in case the logs are not available anymore

cuml/test/test_hdbscan.py::test_hdbscan_cluster_patterns[knn-eom-0-False-25-0.0-15-noisy_circles-1000] Label prop iterations: 9 Label prop iterations: 4 Label prop iterations: 3 Label prop iterations: 2 Iterations: 4 2933,122,82,24,434,805 terminate called after throwing an instance of 'thrust::system::system_error' what(): transform: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered Fatal Python error: Aborted

Current thread 0x00007f5bf449d740 (most recent call first):
File "/workspace/python/cuml/internals/api_decorators.py", line 409 in inner_with_setters
File "/workspace/python/cuml/test/test_hdbscan.py", line 189 in test_hdbscan_cluster_patterns
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/_pytest/python.py", line 183 in pytest_pyfunc_call
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pluggy/manager.py", line 84 in
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in call
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/_pytest/python.py", line 1641 in runtest
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/_pytest/runner.py", line 162 in pytest_runtest_call
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pluggy/manager.py", line 84 in
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in call
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/_pytest/runner.py", line 255 in
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/_pytest/runner.py", line 311 in from_call
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/_pytest/runner.py", line 254 in call_runtest_hook
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/_pytest/runner.py", line 215 in call_and_report
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/_pytest/runner.py", line 126 in runtestprotocol
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/_pytest/runner.py", line 109 in pytest_runtest_protocol
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pluggy/manager.py", line 84 in
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in call
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/_pytest/main.py", line 348 in pytest_runtestloop
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pluggy/manager.py", line 84 in
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in call
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/_pytest/main.py", line 323 in _main
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/_pytest/main.py", line 269 in wrap_session
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/_pytest/main.py", line 316 in pytest_cmdline_main
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pluggy/manager.py", line 84 in
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in call
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/_pytest/config/init.py", line 162 in main
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/_pytest/config/init.py", line 185 in console_main
File "/opt/conda/envs/rapids/bin/pytest", line 11 in
ci/gpu/build.sh: line 249: 13498 Aborted (core dumped) pytest --cache-clear --basetemp=${WORKSPACE}/cuml-cuda-tmp --junitxml=${WORKSPACE}/junit-cuml.xml -v -s -m "not memleak" --durations=50 --timeout=300 --ignore=cuml/test/dask --ignore=cuml/raft --cov-

Update 06/23/21:

In an unrelated PR (#4001) the A100-40GB job failed with:

sk_agg = HDBSCAN(algorithm='generic', approx_min_span_tree=False, gen_min_span_tree=True,
        min_cluster_size=25, min_samples=15)
cuml_agg = HDBSCAN(), digits = 25
    def assert_cluster_counts(sk_agg, cuml_agg, digits=25):
        sk_unique, sk_counts = np.unique(sk_agg.labels_, return_counts=True)
        sk_counts = np.sort(sk_counts)
        cu_unique, cu_counts = cp.unique(cuml_agg.labels_, return_counts=True)
        cu_counts = cp.sort(cu_counts).get()
>       np.testing.assert_almost_equal(sk_counts, cu_counts, decimal=-1 * digits)
E       AssertionError: 
E       Arrays are not almost equal to -25 decimals
E       
E       (shapes (2,), (3,) mismatch)
E        x: array([500, 500])
E        y: array([  6, 497, 497])

Log: https://gpuci.gpuopenanalytics.com/job/rapidsai/job/gpuci/job/cuml/job/prb/job/cuml-gpu-test/CUDA=11.0,GPU_LABEL=gpu-a100,OS=ubuntu16.04,PYTHON=3.7/1063/

dantegd · 2021-06-17T17:28:17Z

cc @Salonijain27 @cjnolet @divyegala I updated the issue with the full error and link to the logs

divyegala · 2021-06-17T18:50:50Z

Interesting. Looks like it's on a transform which are standard for each kernels at best

dantegd · 2021-06-22T18:13:25Z

Update: It happened again in PR #4002 in this log: https://gpuci.gpuopenanalytics.com/job/rapidsai/job/gpuci/job/cuml/job/prb/job/cuml-gpu-test/1059/CUDA=11.2,GPU_LABEL=gpu,OS=ubuntu18.04,PYTHON=3.8/consoleText

Click here to see full error in case the logs are not available anymore

cuml/test/test_hdbscan.py::test_hdbscan_cluster_patterns[knn-eom-0-False-25-0.0-15-noisy_moons-1000] Label prop iterations: 12 Label prop iterations: 6 Label prop iterations: 3 Label prop iterations: 2 Iterations: 4 2763,118,91,18,349,977 terminate called after throwing an instance of 'thrust::system::system_error' what(): transform: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered Fatal Python error: Aborted

Current thread 0x00007fe7b94f1740 (most recent call first):
File "/workspace/python/cuml/internals/api_decorators.py", line 409 in inner_with_setters
File "/workspace/python/cuml/test/test_hdbscan.py", line 195 in test_hdbscan_cluster_patterns
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/_pytest/python.py", line 183 in pytest_pyfunc_call
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pluggy/manager.py", line 84 in
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in call
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/_pytest/python.py", line 1641 in runtest
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/_pytest/runner.py", line 162 in pytest_runtest_call
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pluggy/manager.py", line 84 in
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in call
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/_pytest/runner.py", line 255 in
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/_pytest/runner.py", line 311 in from_call
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/_pytest/runner.py", line 254 in call_runtest_hook
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/_pytest/runner.py", line 215 in call_and_report
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/_pytest/runner.py", line 126 in runtestprotocol
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/_pytest/runner.py", line 109 in pytest_runtest_protocol
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pluggy/manager.py", line 84 in
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in call
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/_pytest/main.py", line 348 in pytest_runtestloop
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pluggy/manager.py", line 84 in
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in call
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/_pytest/main.py", line 323 in _main
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/_pytest/main.py", line 269 in wrap_session
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/_pytest/main.py", line 316 in pytest_cmdline_main
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pluggy/manager.py", line 84 in
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in call
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/_pytest/config/init.py", line 162 in main
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/_pytest/config/init.py", line 185 in console_main
File "/opt/conda/envs/rapids/bin/pytest", line 11 in

divyegala · 2021-07-01T20:52:17Z

@viclafargue was able to reproduce the thrust::transform issue on his local machine, and I just saw it as well while solving the A100 bug. Here's a snippet of Victor running compute-sanitizer:

========= Invalid __global__ read of size 4 bytes
=========     at 0x7580 in _ZN4raft8distance15fusedL2NNkernelIfN3cub12KeyValuePairIifEEiLb1ENS_6linalg12KernelPolicyIfLi2ELi32ELi4ELi4ELi16ELi16EEEN2ML7HDBSCAN22FixConnectivitiesRedOpIifEESB_ZNS0_13fusedL2NNImplIfS4_iLi2ESB_SB_EEvPT0_PKT_SH_SH_SH_T1_SI_SI_PiT3_T4_bbP11CUstream_stEUlRfSO_SO_E_ZSC_IfS4_iLi2ESB_SB_EvSE_SH_SH_SH_SH_SI_SI_SI_SJ_SK_SL_bbSN_EUlfiE_EEvSE_SH_SH_SH_SH_SI_SI_SI_SF_SJ_SL_T5_T6_T7_
=========     by thread (96,0,0) in block (0,0,0)
=========     Address 0x7f2ecec0c3fc is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:cuLaunchKernel [0x7f31b654d718]
=========                in /lib/x86_64-linux-gnu/libcuda.so.1
=========     Host Frame: [0x7f310ed2102b]
=========                in /home/vic/anaconda3/envs/cuml_dev2/lib/python3.8/site-packages/cupy_backends/cuda/api/../../../../../libcudart.so.11.0
=========     Host Frame:cudaLaunchKernel [0x7f310ed6a820]
=========                in /home/vic/anaconda3/envs/cuml_dev2/lib/python3.8/site-packages/cupy_backends/cuda/api/../../../../../libcudart.so.11.0
=========     Host Frame:void raft::distance::fusedL2NNImpl<float, cub::KeyValuePair<int, float>, int, int=2, ML::HDBSCAN::FixConnectivitiesRedOp<int, float>, ML::HDBSCAN::FixConnectivitiesRedOp<int, float>>(int*, float const *, float const , float const , float const , float, float const *, float const *, int*, int, int=2, bool, bool, CUstream_st*) [0x7f30397e6add]

cc @dantegd @cjnolet

divyegala · 2021-07-06T17:19:33Z

solved in #4024 and #4025

wphicks · 2021-07-13T18:23:50Z

This appears to have returned, as shown in https://gpuci.gpuopenanalytics.com/job/rapidsai/job/gpuci/job/cuml/job/prb/job/cuml-gpu-test/CUDA=11.0,GPU_LABEL=gpu-a100,OS=centos7,PYTHON=3.7/1127/consoleText.

Error text below:

cuml/test/test_hdbscan.py::test_hdbscan_cluster_patterns[knn-eom-0-True-25-0.0-15-noisy_circles-1000] Label prop iterations: 9
Label prop iterations: 4
Label prop iterations: 3
Label prop iterations: 2
Iterations: 4
2457,178,114,23,444,1202
terminate called after throwing an instance of 'thrust::system::system_error'
  what():  transform: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered
Fatal Python error: Aborted
Current thread 0x00007f099644f740 (most recent call first):
  File "/workspace/python/cuml/internals/api_decorators.py", line 409 in inner_with_setters
  File "/workspace/python/cuml/test/test_hdbscan.py", line 323 in test_hdbscan_cluster_patterns
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/_pytest/python.py", line 183 in pytest_pyfunc_call
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/pluggy/manager.py", line 87 in <lambda>
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/_pytest/python.py", line 1641 in runtest
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/_pytest/runner.py", line 162 in pytest_runtest_call
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/pluggy/manager.py", line 87 in <lambda>
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/_pytest/runner.py", line 255 in <lambda>
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/_pytest/runner.py", line 311 in from_call
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/_pytest/runner.py", line 255 in call_runtest_hook
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/_pytest/runner.py", line 215 in call_and_report
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/_pytest/runner.py", line 126 in runtestprotocol
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/_pytest/runner.py", line 109 in pytest_runtest_protocol
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/pluggy/manager.py", line 87 in <lambda>
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/_pytest/main.py", line 348 in pytest_runtestloop
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/pluggy/manager.py", line 87 in <lambda>
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/_pytest/main.py", line 323 in _main
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/_pytest/main.py", line 269 in wrap_session
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/_pytest/main.py", line 316 in pytest_cmdline_main
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/pluggy/manager.py", line 87 in <lambda>
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/_pytest/config/__init__.py", line 163 in main
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/_pytest/config/__init__.py", line 185 in console_main
  File "/opt/conda/envs/rapids/bin/pytest", line 11 in <module>
ci/gpu/build.sh: line 249: 13175 Aborted                 (core dumped) pytest --cache-clear --basetemp=${WORKSPACE}/cuml-cuda-tmp --junitxml=${WORKSPACE}/junit-cuml.xml -v -s -m "not memleak" --durations=50 --timeout=300 --ignore=cuml/test/dask --ignore=cuml/raft --cov-config=.coveragerc --cov=cuml --cov-report=xml:${WORKSPACE}/python/cuml/cuml-coverage.xml --cov-report term

dantegd · 2021-07-13T19:54:47Z

cc @divyegala for the above post by @wphicks

divyegala · 2021-07-15T16:13:43Z

Closed by #4052

dantegd added bug Something isn't working tests Unit testing for project labels Jun 17, 2021

dantegd assigned divyegala Jun 22, 2021

cjnolet mentioned this issue Jun 23, 2021

[TASK] Post HDBSCAN merge tasks #3879

Open

21 tasks

divyegala closed this as completed Jul 6, 2021

dantegd reopened this Jul 13, 2021

divyegala closed this as completed Jul 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Sporadic pytest crash in hdbscan #3997

[BUG] Sporadic pytest crash in hdbscan #3997

dantegd commented Jun 17, 2021 •

edited

Loading

dantegd commented Jun 17, 2021

divyegala commented Jun 17, 2021

dantegd commented Jun 22, 2021

divyegala commented Jul 1, 2021

divyegala commented Jul 6, 2021

wphicks commented Jul 13, 2021

dantegd commented Jul 13, 2021

divyegala commented Jul 15, 2021

[BUG] Sporadic pytest crash in hdbscan #3997

[BUG] Sporadic pytest crash in hdbscan #3997

Comments

dantegd commented Jun 17, 2021 • edited Loading

Update 06/23/21:

dantegd commented Jun 17, 2021

divyegala commented Jun 17, 2021

dantegd commented Jun 22, 2021

divyegala commented Jul 1, 2021

divyegala commented Jul 6, 2021

wphicks commented Jul 13, 2021

dantegd commented Jul 13, 2021

divyegala commented Jul 15, 2021

dantegd commented Jun 17, 2021 •

edited

Loading