[BUG] Importing cuml causes all Dask partitions to associate with GPU 0 #5206

hcho3 · 2023-02-06T06:35:08Z

Describe the bug
On a LocalCUDACluster with multiple GPUs, I am observing all Dask partitions to be allocated to GPU 0, causing XGBoost to error out. Weirdly, removing import cuml fixes the problem.

Steps/Code to reproduce bug
Run this Python script:

import dask_cudf
import glob
from dask import array as da
from dask import dataframe as dd
from dask.distributed import Client, wait
from dask_cuda import LocalCUDACluster
import xgboost as xgb

# Un-comment this line to observe the difference in behavior
# import cuml

if __name__ == "__main__":
    with LocalCUDACluster() as cluster:
        with Client(cluster) as client:
            m = 100000
            n = 100
            X = da.random.random(size=(m, n), chunks=10000)
            y = da.random.random(size=(m, ), chunks=10000)
            X = dask_cudf.from_dask_dataframe(dd.from_dask_array(X))
            y = dask_cudf.from_dask_dataframe(dd.from_dask_array(y))
            params = {
                "verbosity": 2,
                "tree_method": "gpu_hist"
            }
            dtrain = xgb.dask.DaskQuantileDMatrix(client, X, y)
            output = xgb.dask.train(client, params, dtrain, num_boost_round=4, evals=[(dtrain, 'train')])

With import cuml commented out, the Python program runs successfully:

[06:31:12] task [xgboost.dask-0]:tcp://127.0.0.1:36715 got new rank 0
[06:31:12] task [xgboost.dask-1]:tcp://127.0.0.1:40349 got new rank 1
[06:31:12] task [xgboost.dask-2]:tcp://127.0.0.1:39421 got new rank 2
[06:31:12] task [xgboost.dask-3]:tcp://127.0.0.1:44467 got new rank 3
[0]     train-rmse:0.28842
[1]     train-rmse:0.28799
[2]     train-rmse:0.28754
[3]     train-rmse:0.28705
{'train': OrderedDict([('rmse', [0.28842118658057203, 0.2879935986895539, 0.2875382048036173, 0.28704809112503155])])}

If import cuml is un-commented, we get an error:

xgboost.core.XGBoostError: [06:33:40] src/collective/nccl_device_communicator.cuh:49:
Check failed: n_uniques == world (1 vs. 4) :
Multiple processes within communication group running on same CUDA device is not supported.

This is because all the Dask partitions were allocated to GPU 0. See the output from nvidia-smi:

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     42613      C   python                           2022MiB |
|    0   N/A  N/A     42661      C   ...da/envs/rapids/bin/python      306MiB |
|    0   N/A  N/A     42664      C   ...da/envs/rapids/bin/python      306MiB |
|    0   N/A  N/A     42669      C   ...da/envs/rapids/bin/python      306MiB |
|    0   N/A  N/A     42672      C   ...da/envs/rapids/bin/python      306MiB |
+-----------------------------------------------------------------------------+

Expected behavior
Importing cuML should not affect the behavior of Dask arrays.

Environment details (please complete the following information):

Environment location: GCP, Latest 22.12 Docker image

The text was updated successfully, but these errors were encountered:

hcho3 · 2023-02-06T06:46:44Z

The bug also exists in the latest nightly Docker image (rapidsai/rapidsai-core-nightly:23.02-cuda11.5-base-ubuntu20.04-py3.9).

hcho3 · 2023-02-06T06:58:22Z

The bug was probably introduced in 22.12. Using the 22.10 Docker image (nvcr.io/nvidia/rapidsai/rapidsai-core:22.10-cuda11.5-base-ubuntu20.04-py3.9) fixes the problem.

closes issue #5206 Authors: - Dante Gama Dessavre (https://github.com/dantegd) Approvers: - Corey J. Nolet (https://github.com/cjnolet)

closes issue rapidsai#5206 Authors: - Dante Gama Dessavre (https://github.com/dantegd) Approvers: - Corey J. Nolet (https://github.com/cjnolet)

hcho3 added bug Something isn't working ? - Needs Triage Need team to review and classify labels Feb 6, 2023

hcho3 added Dask / cuml.dask Issue/PR related to Python level dask or cuml.dask features. and removed ? - Needs Triage Need team to review and classify labels Feb 6, 2023

dantegd mentioned this issue Feb 7, 2023

Fix for creation of CUDA context at import time #5211

Merged

raydouglass pushed a commit that referenced this issue Feb 7, 2023

Fix for creation of CUDA context at import time (#5211)

277f2da

closes issue #5206 Authors: - Dante Gama Dessavre (https://github.com/dantegd) Approvers: - Corey J. Nolet (https://github.com/cjnolet)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Importing cuml causes all Dask partitions to associate with GPU 0 #5206

[BUG] Importing cuml causes all Dask partitions to associate with GPU 0 #5206

hcho3 commented Feb 6, 2023 •

edited

Loading

hcho3 commented Feb 6, 2023

hcho3 commented Feb 6, 2023

[BUG] Importing cuml causes all Dask partitions to associate with GPU 0 #5206

[BUG] Importing cuml causes all Dask partitions to associate with GPU 0 #5206

Comments

hcho3 commented Feb 6, 2023 • edited Loading

hcho3 commented Feb 6, 2023

hcho3 commented Feb 6, 2023

hcho3 commented Feb 6, 2023 •

edited

Loading