Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Importing cuml causes all Dask partitions to associate with GPU 0 #5206

Open
hcho3 opened this issue Feb 6, 2023 · 2 comments
Open
Labels
bug Something isn't working Dask / cuml.dask Issue/PR related to Python level dask or cuml.dask features.

Comments

@hcho3
Copy link
Contributor

hcho3 commented Feb 6, 2023

Describe the bug
On a LocalCUDACluster with multiple GPUs, I am observing all Dask partitions to be allocated to GPU 0, causing XGBoost to error out. Weirdly, removing import cuml fixes the problem.

Steps/Code to reproduce bug
Run this Python script:

import dask_cudf
import glob
from dask import array as da
from dask import dataframe as dd
from dask.distributed import Client, wait
from dask_cuda import LocalCUDACluster
import xgboost as xgb

# Un-comment this line to observe the difference in behavior
# import cuml

if __name__ == "__main__":
    with LocalCUDACluster() as cluster:
        with Client(cluster) as client:
            m = 100000
            n = 100
            X = da.random.random(size=(m, n), chunks=10000)
            y = da.random.random(size=(m, ), chunks=10000)
            X = dask_cudf.from_dask_dataframe(dd.from_dask_array(X))
            y = dask_cudf.from_dask_dataframe(dd.from_dask_array(y))
            params = {
                "verbosity": 2,
                "tree_method": "gpu_hist"
            }
            dtrain = xgb.dask.DaskQuantileDMatrix(client, X, y)
            output = xgb.dask.train(client, params, dtrain, num_boost_round=4, evals=[(dtrain, 'train')])

With import cuml commented out, the Python program runs successfully:

[06:31:12] task [xgboost.dask-0]:tcp://127.0.0.1:36715 got new rank 0
[06:31:12] task [xgboost.dask-1]:tcp://127.0.0.1:40349 got new rank 1
[06:31:12] task [xgboost.dask-2]:tcp://127.0.0.1:39421 got new rank 2
[06:31:12] task [xgboost.dask-3]:tcp://127.0.0.1:44467 got new rank 3
[0]     train-rmse:0.28842
[1]     train-rmse:0.28799
[2]     train-rmse:0.28754
[3]     train-rmse:0.28705
{'train': OrderedDict([('rmse', [0.28842118658057203, 0.2879935986895539, 0.2875382048036173, 0.28704809112503155])])}

If import cuml is un-commented, we get an error:

xgboost.core.XGBoostError: [06:33:40] src/collective/nccl_device_communicator.cuh:49:
Check failed: n_uniques == world (1 vs. 4) :
Multiple processes within communication group running on same CUDA device is not supported. 

This is because all the Dask partitions were allocated to GPU 0. See the output from nvidia-smi:

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     42613      C   python                           2022MiB |
|    0   N/A  N/A     42661      C   ...da/envs/rapids/bin/python      306MiB |
|    0   N/A  N/A     42664      C   ...da/envs/rapids/bin/python      306MiB |
|    0   N/A  N/A     42669      C   ...da/envs/rapids/bin/python      306MiB |
|    0   N/A  N/A     42672      C   ...da/envs/rapids/bin/python      306MiB |
+-----------------------------------------------------------------------------+

Expected behavior
Importing cuML should not affect the behavior of Dask arrays.

Environment details (please complete the following information):

  • Environment location: GCP, Latest 22.12 Docker image
@hcho3 hcho3 added bug Something isn't working ? - Needs Triage Need team to review and classify labels Feb 6, 2023
@hcho3 hcho3 added Dask / cuml.dask Issue/PR related to Python level dask or cuml.dask features. and removed ? - Needs Triage Need team to review and classify labels Feb 6, 2023
@hcho3
Copy link
Contributor Author

hcho3 commented Feb 6, 2023

The bug also exists in the latest nightly Docker image (rapidsai/rapidsai-core-nightly:23.02-cuda11.5-base-ubuntu20.04-py3.9).

@hcho3
Copy link
Contributor Author

hcho3 commented Feb 6, 2023

The bug was probably introduced in 22.12. Using the 22.10 Docker image (nvcr.io/nvidia/rapidsai/rapidsai-core:22.10-cuda11.5-base-ubuntu20.04-py3.9) fixes the problem.

raydouglass pushed a commit that referenced this issue Feb 7, 2023
jakirkham pushed a commit to jakirkham/cuml that referenced this issue Feb 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Dask / cuml.dask Issue/PR related to Python level dask or cuml.dask features.
Projects
None yet
Development

No branches or pull requests

1 participant