[BUG] Intermittent CI failures with xgboost #4729
Labels
? - Needs Triage
Need team to review and classify
bug
Something isn't working
inactive-30d
inactive-90d
Sometimes we observe failures in CI of the following type:
The segfault appears to happen when xgboost calls nccl. Xgboost, when built with nccl, initialises nccl even for a single GPU problem.
The failure may occur in any of the tests that call xgboost in the file
cuml/tests/explainer/test_gpu_treeshap.py
. It is nondeterministic and not related to one single test.It has been observed to happen in both centos and ubuntu CI environments.
The type of GPU running the tests on CI does not seem to be a factor - it occurs sometimes and sometimes not when run on V100-32GB GPUs.
The frequency of the failures has increased since the number of xgboost tests has increased in #4671.
I have been unable to reproduce this locally using the same docker container as a CI failure:
gpuci/rapidsai:22.06-cuda11.0-devel-centos7-py3.8
. I have run the tests hundreds of times.The text was updated successfully, but these errors were encountered: