Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Intermittent CI failures with xgboost #4729

Open
RAMitchell opened this issue May 9, 2022 · 3 comments
Open

[BUG] Intermittent CI failures with xgboost #4729

RAMitchell opened this issue May 9, 2022 · 3 comments
Labels
? - Needs Triage Need team to review and classify bug Something isn't working inactive-30d inactive-90d

Comments

@RAMitchell
Copy link
Contributor

Sometimes we observe failures in CI of the following type:

cuml/tests/explainer/test_gpu_treeshap.py::test_with_hypothesis [85d2e7739a0d:2816 :0:3362] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x55f54f237ba8)
==== backtrace (tid:   3362) ====
 0  /opt/conda/envs/rapids/lib/python3.8/site-packages/ucp/_libs/../../../../libucs.so.0(ucs_handle_error+0x155) [0x7fa5840f93f5]
 1  /opt/conda/envs/rapids/lib/python3.8/site-packages/ucp/_libs/../../../../libucs.so.0(+0x2d791) [0x7fa5840f9791]
 2  /opt/conda/envs/rapids/lib/python3.8/site-packages/ucp/_libs/../../../../libucs.so.0(+0x2d962) [0x7fa5840f9962]
 3  /usr/lib64/libc.so.6(+0x36400) [0x7fa6aca2c400]
 4  /opt/conda/envs/rapids/lib/python3.8/site-packages/cuml/common/../../../../libnccl.so.2(+0x5812e) [0x7fa5be6f912e]
 5  /opt/conda/envs/rapids/lib/python3.8/site-packages/cuml/common/../../../../libnccl.so.2(+0x43949) [0x7fa5be6e4949]
 6  /usr/lib64/libpthread.so.0(+0x7ea5) [0x7fa6ad6dcea5]
 7  /usr/lib64/libc.so.6(clone+0x6d) [0x7fa6acaf4b0d]
=================================
Fatal Python error: Segmentation fault

Thread 0x00007fa446ffd700 (most recent call first):
  File "/opt/conda/envs/rapids/lib/python3.8/threading.py", line 306 in wait
  File "/opt/conda/envs/rapids/lib/python3.8/threading.py", line 558 in wait
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/tqdm/_monitor.py", line 60 in run
  File "/opt/conda/envs/rapids/lib/python3.8/threading.py", line 932 in _bootstrap_inner
  File "/opt/conda/envs/rapids/lib/python3.8/threading.py", line 890 in _bootstrap

Thread 0x00007fa6adb02740 (most recent call first):
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/xgboost/core.py", line 1423 in __del__
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/xgboost/training.py", line 188 in train
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/xgboost/sklearn.py", line 789 in fit
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/xgboost/core.py", line 506 in inner_f
  File "/workspace/python/cuml/tests/explainer/test_gpu_treeshap.py", line 522 in learn_model
  File "/workspace/python/cuml/tests/explainer/test_gpu_treeshap.py", line 646 in shap_strategy
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/hypothesis/strategies/_internal/core.py", line 1450 in do_draw
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/hypothesis/internal/conjecture/data.py", line 874 in draw
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/hypothesis/strategies/_internal/lazy.py", line 156 in do_draw
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/hypothesis/internal/conjecture/data.py", line 874 in draw
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/hypothesis/strategies/_internal/collections.py", line 58 in <genexpr>
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/hypothesis/strategies/_internal/collections.py", line 58 in do_draw
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/hypothesis/internal/conjecture/data.py", line 874 in draw
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/hypothesis/strategies/_internal/strategies.py", line 823 in do_draw
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/hypothesis/internal/conjecture/data.py", line 874 in draw
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/hypothesis/strategies/_internal/lazy.py", line 156 in do_draw
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/hypothesis/internal/conjecture/data.py", line 874 in draw
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/hypothesis/strategies/_internal/strategies.py", line 823 in do_draw
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/hypothesis/internal/conjecture/data.py", line 874 in draw
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/hypothesis/strategies/_internal/collections.py", line 58 in <genexpr>
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/hypothesis/strategies/_internal/collections.py", line 58 in do_draw
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/hypothesis/internal/conjecture/data.py", line 878 in draw
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/hypothesis/core.py", line 622 in run
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/hypothesis/executors.py", line 47 in default_new_style_executor
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/hypothesis/core.py", line 664 in execute_once
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/hypothesis/core.py", line 726 in _execute_once_for_engine
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/hypothesis/internal/conjecture/engine.py", line 184 in __stoppable_test_function
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/hypothesis/internal/conjecture/engine.py", line 208 in test_function
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/hypothesis/internal/conjecture/engine.py", line 1055 in cached_test_function
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/hypothesis/internal/conjecture/engine.py", line 608 in generate_new_examples
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/hypothesis/internal/conjecture/engine.py", line 876 in _run
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/hypothesis/internal/conjecture/engine.py", line 470 in run
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/hypothesis/core.py", line 803 in run_engine
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/hypothesis/core.py", line 1206 in wrapped_test
  File "/workspace/python/cuml/tests/explainer/test_gpu_treeshap.py", line 686 in test_with_hypothesis
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/_pytest/python.py", line 192 in pytest_pyfunc_call
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pluggy/_callers.py", line 39 in _multicall
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pluggy/_manager.py", line 80 in _hookexec
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pluggy/_hooks.py", line 265 in __call__
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/_pytest/python.py", line 1761 in runtest
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/_pytest/runner.py", line 166 in pytest_runtest_call
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pluggy/_callers.py", line 39 in _multicall
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pluggy/_manager.py", line 80 in _hookexec
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pluggy/_hooks.py", line 265 in __call__
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/_pytest/runner.py", line 259 in <lambda>
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/_pytest/runner.py", line 338 in from_call
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/_pytest/runner.py", line 258 in call_runtest_hook
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/_pytest/runner.py", line 219 in call_and_report
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/_pytest/runner.py", line 130 in runtestprotocol
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/_pytest/runner.py", line 111 in pytest_runtest_protocol
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pluggy/_callers.py", line 39 in _multicall
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pluggy/_manager.py", line 80 in _hookexec
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pluggy/_hooks.py", line 265 in __call__
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/_pytest/main.py", line 347 in pytest_runtestloop
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pluggy/_callers.py", line 39 in _multicall
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pluggy/_manager.py", line 80 in _hookexec
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pluggy/_hooks.py", line 265 in __call__
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/_pytest/main.py", line 322 in _main
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/_pytest/main.py", line 268 in wrap_session
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/_pytest/main.py", line 315 in pytest_cmdline_main
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pluggy/_callers.py", line 39 in _multicall
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pluggy/_manager.py", line 80 in _hookexec
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pluggy/_hooks.py", line 265 in __call__
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/_pytest/config/__init__.py", line 164 in main
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/_pytest/config/__init__.py", line 187 in console_main
  File "/opt/conda/envs/rapids/bin/pytest", line 11 in <module>
ci/gpu/build.sh: line 270:  2816 Segmentation fault      (core dumped) pytest --cache-clear --basetemp=${WORKSPACE}/cuml-cuda-tmp --junitxml=${WORKSPACE}/junit-cuml.xml -v -s -m "not memleak" --durations=50 --timeout=300 --ignore=cuml/tests/dask --ignore=cuml/raft --cov-config=.coveragerc --cov=cuml --cov-report=xml:${WORKSPACE}/python/cuml/cuml-coverage.xml --cov-report term

The segfault appears to happen when xgboost calls nccl. Xgboost, when built with nccl, initialises nccl even for a single GPU problem.

The failure may occur in any of the tests that call xgboost in the file cuml/tests/explainer/test_gpu_treeshap.py. It is nondeterministic and not related to one single test.

It has been observed to happen in both centos and ubuntu CI environments.

The type of GPU running the tests on CI does not seem to be a factor - it occurs sometimes and sometimes not when run on V100-32GB GPUs.

The frequency of the failures has increased since the number of xgboost tests has increased in #4671.

I have been unable to reproduce this locally using the same docker container as a CI failure: gpuci/rapidsai:22.06-cuda11.0-devel-centos7-py3.8. I have run the tests hundreds of times.

@RAMitchell RAMitchell added ? - Needs Triage Need team to review and classify bug Something isn't working labels May 9, 2022
@trivialfis
Copy link
Member

I will look into this in a few days.

@github-actions
Copy link

github-actions bot commented Jun 8, 2022

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@github-actions
Copy link

github-actions bot commented Sep 6, 2022

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
? - Needs Triage Need team to review and classify bug Something isn't working inactive-30d inactive-90d
Projects
None yet
Development

No branches or pull requests

2 participants