Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Sporadic pytest crash in hdbscan #3997

Closed
dantegd opened this issue Jun 17, 2021 · 8 comments
Closed

[BUG] Sporadic pytest crash in hdbscan #3997

dantegd opened this issue Jun 17, 2021 · 8 comments
Assignees
Labels
bug Something isn't working tests Unit testing for project

Comments

@dantegd
Copy link
Member

dantegd commented Jun 17, 2021

Describe the bug
Errors in a couple of hdbscan tests:

cuml/test/test_hdbscan.py::test_hdbscan_cluster_patterns[knn-eom-0-False-25-0.0-15-noisy_circles-1000]
...
terminate called after throwing an instance of 'thrust::system::system_error'
  what():  transform: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered
cuml/test/test_hdbscan.py::test_hdbscan_cluster_patterns[knn-eom-0-True-25-0.0-15-noisy_moons-1000]
...
terminate called after throwing an instance of 'thrust::system::system_error'
  what():  transform: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered

Found the logs:

https://gpuci.gpuopenanalytics.com/job/rapidsai/job/gpuci/job/cuml/job/prb/job/cuml-gpu-test/1048/CUDA=11.2,GPU_LABEL=gpu,OS=ubuntu18.04,PYTHON=3.8/

https://gpuci.gpuopenanalytics.com/job/rapidsai/job/gpuci/job/cuml/job/prb/job/cuml-gpu-test/1048/CUDA=11.2,GPU_LABEL=gpu,OS=centos7,PYTHON=3.8/

Click here to see full error in case the logs are not available anymore
cuml/test/test_hdbscan.py::test_hdbscan_cluster_patterns[knn-eom-0-False-25-0.0-15-noisy_circles-1000] Label prop iterations: 9
Label prop iterations: 4
Label prop iterations: 3
Label prop iterations: 2
Iterations: 4
2933,122,82,24,434,805
terminate called after throwing an instance of 'thrust::system::system_error'
  what():  transform: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered
Fatal Python error: Aborted

Current thread 0x00007f5bf449d740 (most recent call first):
File "/workspace/python/cuml/internals/api_decorators.py", line 409 in inner_with_setters
File "/workspace/python/cuml/test/test_hdbscan.py", line 189 in test_hdbscan_cluster_patterns
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/_pytest/python.py", line 183 in pytest_pyfunc_call
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pluggy/manager.py", line 84 in
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in call
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/_pytest/python.py", line 1641 in runtest
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/_pytest/runner.py", line 162 in pytest_runtest_call
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pluggy/manager.py", line 84 in
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in call
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/_pytest/runner.py", line 255 in
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/_pytest/runner.py", line 311 in from_call
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/_pytest/runner.py", line 254 in call_runtest_hook
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/_pytest/runner.py", line 215 in call_and_report
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/_pytest/runner.py", line 126 in runtestprotocol
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/_pytest/runner.py", line 109 in pytest_runtest_protocol
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pluggy/manager.py", line 84 in
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in call
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/_pytest/main.py", line 348 in pytest_runtestloop
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pluggy/manager.py", line 84 in
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in call
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/_pytest/main.py", line 323 in _main
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/_pytest/main.py", line 269 in wrap_session
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/_pytest/main.py", line 316 in pytest_cmdline_main
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pluggy/manager.py", line 84 in
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in call
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/_pytest/config/init.py", line 162 in main
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/_pytest/config/init.py", line 185 in console_main
File "/opt/conda/envs/rapids/bin/pytest", line 11 in
ci/gpu/build.sh: line 249: 13498 Aborted (core dumped) pytest --cache-clear --basetemp=${WORKSPACE}/cuml-cuda-tmp --junitxml=${WORKSPACE}/junit-cuml.xml -v -s -m "not memleak" --durations=50 --timeout=300 --ignore=cuml/test/dask --ignore=cuml/raft --cov-

Update 06/23/21:

In an unrelated PR (#4001) the A100-40GB job failed with:

sk_agg = HDBSCAN(algorithm='generic', approx_min_span_tree=False, gen_min_span_tree=True,
        min_cluster_size=25, min_samples=15)
cuml_agg = HDBSCAN(), digits = 25
    def assert_cluster_counts(sk_agg, cuml_agg, digits=25):
        sk_unique, sk_counts = np.unique(sk_agg.labels_, return_counts=True)
        sk_counts = np.sort(sk_counts)
        cu_unique, cu_counts = cp.unique(cuml_agg.labels_, return_counts=True)
        cu_counts = cp.sort(cu_counts).get()
>       np.testing.assert_almost_equal(sk_counts, cu_counts, decimal=-1 * digits)
E       AssertionError: 
E       Arrays are not almost equal to -25 decimals
E       
E       (shapes (2,), (3,) mismatch)
E        x: array([500, 500])
E        y: array([  6, 497, 497])

Log: https://gpuci.gpuopenanalytics.com/job/rapidsai/job/gpuci/job/cuml/job/prb/job/cuml-gpu-test/CUDA=11.0,GPU_LABEL=gpu-a100,OS=ubuntu16.04,PYTHON=3.7/1063/

@dantegd dantegd added bug Something isn't working tests Unit testing for project labels Jun 17, 2021
@dantegd
Copy link
Member Author

dantegd commented Jun 17, 2021

cc @Salonijain27 @cjnolet @divyegala I updated the issue with the full error and link to the logs

@divyegala
Copy link
Member

Interesting. Looks like it's on a transform which are standard for each kernels at best

@dantegd
Copy link
Member Author

dantegd commented Jun 22, 2021

Update: It happened again in PR #4002 in this log: https://gpuci.gpuopenanalytics.com/job/rapidsai/job/gpuci/job/cuml/job/prb/job/cuml-gpu-test/1059/CUDA=11.2,GPU_LABEL=gpu,OS=ubuntu18.04,PYTHON=3.8/consoleText

Click here to see full error in case the logs are not available anymore
cuml/test/test_hdbscan.py::test_hdbscan_cluster_patterns[knn-eom-0-False-25-0.0-15-noisy_moons-1000] Label prop iterations: 12
Label prop iterations: 6
Label prop iterations: 3
Label prop iterations: 2
Iterations: 4
2763,118,91,18,349,977
terminate called after throwing an instance of 'thrust::system::system_error'
  what():  transform: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered
Fatal Python error: Aborted

Current thread 0x00007fe7b94f1740 (most recent call first):
File "/workspace/python/cuml/internals/api_decorators.py", line 409 in inner_with_setters
File "/workspace/python/cuml/test/test_hdbscan.py", line 195 in test_hdbscan_cluster_patterns
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/_pytest/python.py", line 183 in pytest_pyfunc_call
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pluggy/manager.py", line 84 in
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in call
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/_pytest/python.py", line 1641 in runtest
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/_pytest/runner.py", line 162 in pytest_runtest_call
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pluggy/manager.py", line 84 in
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in call
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/_pytest/runner.py", line 255 in
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/_pytest/runner.py", line 311 in from_call
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/_pytest/runner.py", line 254 in call_runtest_hook
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/_pytest/runner.py", line 215 in call_and_report
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/_pytest/runner.py", line 126 in runtestprotocol
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/_pytest/runner.py", line 109 in pytest_runtest_protocol
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pluggy/manager.py", line 84 in
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in call
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/_pytest/main.py", line 348 in pytest_runtestloop
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pluggy/manager.py", line 84 in
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in call
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/_pytest/main.py", line 323 in _main
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/_pytest/main.py", line 269 in wrap_session
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/_pytest/main.py", line 316 in pytest_cmdline_main
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pluggy/manager.py", line 84 in
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in call
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/_pytest/config/init.py", line 162 in main
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/_pytest/config/init.py", line 185 in console_main
File "/opt/conda/envs/rapids/bin/pytest", line 11 in

@divyegala
Copy link
Member

@viclafargue was able to reproduce the thrust::transform issue on his local machine, and I just saw it as well while solving the A100 bug. Here's a snippet of Victor running compute-sanitizer:

========= Invalid __global__ read of size 4 bytes
=========     at 0x7580 in _ZN4raft8distance15fusedL2NNkernelIfN3cub12KeyValuePairIifEEiLb1ENS_6linalg12KernelPolicyIfLi2ELi32ELi4ELi4ELi16ELi16EEEN2ML7HDBSCAN22FixConnectivitiesRedOpIifEESB_ZNS0_13fusedL2NNImplIfS4_iLi2ESB_SB_EEvPT0_PKT_SH_SH_SH_T1_SI_SI_PiT3_T4_bbP11CUstream_stEUlRfSO_SO_E_ZSC_IfS4_iLi2ESB_SB_EvSE_SH_SH_SH_SH_SI_SI_SI_SJ_SK_SL_bbSN_EUlfiE_EEvSE_SH_SH_SH_SH_SI_SI_SI_SF_SJ_SL_T5_T6_T7_
=========     by thread (96,0,0) in block (0,0,0)
=========     Address 0x7f2ecec0c3fc is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:cuLaunchKernel [0x7f31b654d718]
=========                in /lib/x86_64-linux-gnu/libcuda.so.1
=========     Host Frame: [0x7f310ed2102b]
=========                in /home/vic/anaconda3/envs/cuml_dev2/lib/python3.8/site-packages/cupy_backends/cuda/api/../../../../../libcudart.so.11.0
=========     Host Frame:cudaLaunchKernel [0x7f310ed6a820]
=========                in /home/vic/anaconda3/envs/cuml_dev2/lib/python3.8/site-packages/cupy_backends/cuda/api/../../../../../libcudart.so.11.0
=========     Host Frame:void raft::distance::fusedL2NNImpl<float, cub::KeyValuePair<int, float>, int, int=2, ML::HDBSCAN::FixConnectivitiesRedOp<int, float>, ML::HDBSCAN::FixConnectivitiesRedOp<int, float>>(int*, float const *, float const , float const , float const , float, float const *, float const *, int*, int, int=2, bool, bool, CUstream_st*) [0x7f30397e6add]

cc @dantegd @cjnolet

@divyegala
Copy link
Member

solved in #4024 and #4025

@wphicks
Copy link
Contributor

wphicks commented Jul 13, 2021

This appears to have returned, as shown in https://gpuci.gpuopenanalytics.com/job/rapidsai/job/gpuci/job/cuml/job/prb/job/cuml-gpu-test/CUDA=11.0,GPU_LABEL=gpu-a100,OS=centos7,PYTHON=3.7/1127/consoleText.

Error text below:

cuml/test/test_hdbscan.py::test_hdbscan_cluster_patterns[knn-eom-0-True-25-0.0-15-noisy_circles-1000] Label prop iterations: 9
Label prop iterations: 4
Label prop iterations: 3
Label prop iterations: 2
Iterations: 4
2457,178,114,23,444,1202
terminate called after throwing an instance of 'thrust::system::system_error'
  what():  transform: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered
Fatal Python error: Aborted
Current thread 0x00007f099644f740 (most recent call first):
  File "/workspace/python/cuml/internals/api_decorators.py", line 409 in inner_with_setters
  File "/workspace/python/cuml/test/test_hdbscan.py", line 323 in test_hdbscan_cluster_patterns
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/_pytest/python.py", line 183 in pytest_pyfunc_call
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/pluggy/manager.py", line 87 in <lambda>
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/_pytest/python.py", line 1641 in runtest
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/_pytest/runner.py", line 162 in pytest_runtest_call
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/pluggy/manager.py", line 87 in <lambda>
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/_pytest/runner.py", line 255 in <lambda>
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/_pytest/runner.py", line 311 in from_call
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/_pytest/runner.py", line 255 in call_runtest_hook
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/_pytest/runner.py", line 215 in call_and_report
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/_pytest/runner.py", line 126 in runtestprotocol
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/_pytest/runner.py", line 109 in pytest_runtest_protocol
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/pluggy/manager.py", line 87 in <lambda>
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/_pytest/main.py", line 348 in pytest_runtestloop
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/pluggy/manager.py", line 87 in <lambda>
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/_pytest/main.py", line 323 in _main
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/_pytest/main.py", line 269 in wrap_session
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/_pytest/main.py", line 316 in pytest_cmdline_main
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/pluggy/manager.py", line 87 in <lambda>
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/_pytest/config/__init__.py", line 163 in main
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/_pytest/config/__init__.py", line 185 in console_main
  File "/opt/conda/envs/rapids/bin/pytest", line 11 in <module>
ci/gpu/build.sh: line 249: 13175 Aborted                 (core dumped) pytest --cache-clear --basetemp=${WORKSPACE}/cuml-cuda-tmp --junitxml=${WORKSPACE}/junit-cuml.xml -v -s -m "not memleak" --durations=50 --timeout=300 --ignore=cuml/test/dask --ignore=cuml/raft --cov-config=.coveragerc --cov=cuml --cov-report=xml:${WORKSPACE}/python/cuml/cuml-coverage.xml --cov-report term

@dantegd dantegd reopened this Jul 13, 2021
@dantegd
Copy link
Member Author

dantegd commented Jul 13, 2021

cc @divyegala for the above post by @wphicks

@divyegala
Copy link
Member

Closed by #4052

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working tests Unit testing for project
Projects
None yet
Development

No branches or pull requests

3 participants