-
Notifications
You must be signed in to change notification settings - Fork 540
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Sporadic pytest crash in hdbscan #3997
Comments
cc @Salonijain27 @cjnolet @divyegala I updated the issue with the full error and link to the logs |
Interesting. Looks like it's on a transform which are standard for each kernels at best |
Update: It happened again in PR #4002 in this log: https://gpuci.gpuopenanalytics.com/job/rapidsai/job/gpuci/job/cuml/job/prb/job/cuml-gpu-test/1059/CUDA=11.2,GPU_LABEL=gpu,OS=ubuntu18.04,PYTHON=3.8/consoleText Click here to see full error in case the logs are not available anymorecuml/test/test_hdbscan.py::test_hdbscan_cluster_patterns[knn-eom-0-False-25-0.0-15-noisy_moons-1000] Label prop iterations: 12 Label prop iterations: 6 Label prop iterations: 3 Label prop iterations: 2 Iterations: 4 2763,118,91,18,349,977 terminate called after throwing an instance of 'thrust::system::system_error' what(): transform: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered Fatal Python error: Aborted |
@viclafargue was able to reproduce the
|
This appears to have returned, as shown in https://gpuci.gpuopenanalytics.com/job/rapidsai/job/gpuci/job/cuml/job/prb/job/cuml-gpu-test/CUDA=11.0,GPU_LABEL=gpu-a100,OS=centos7,PYTHON=3.7/1127/consoleText. Error text below:
|
cc @divyegala for the above post by @wphicks |
Closed by #4052 |
Describe the bug
Errors in a couple of hdbscan tests:
Found the logs:
https://gpuci.gpuopenanalytics.com/job/rapidsai/job/gpuci/job/cuml/job/prb/job/cuml-gpu-test/1048/CUDA=11.2,GPU_LABEL=gpu,OS=ubuntu18.04,PYTHON=3.8/
https://gpuci.gpuopenanalytics.com/job/rapidsai/job/gpuci/job/cuml/job/prb/job/cuml-gpu-test/1048/CUDA=11.2,GPU_LABEL=gpu,OS=centos7,PYTHON=3.8/
Click here to see full error in case the logs are not available anymore
Update 06/23/21:
In an unrelated PR (#4001) the A100-40GB job failed with:
Log: https://gpuci.gpuopenanalytics.com/job/rapidsai/job/gpuci/job/cuml/job/prb/job/cuml-gpu-test/CUDA=11.0,GPU_LABEL=gpu-a100,OS=ubuntu16.04,PYTHON=3.7/1063/
The text was updated successfully, but these errors were encountered: