Unable to create LocalCUDACluster #1119

randerzander · 2023-02-16T17:24:36Z

With the latest nightlies:

conda list | grep dask
dask                      2023.1.1           pyhd8ed1ab_0    conda-forge
dask-core                 2023.1.1           pyhd8ed1ab_0    conda-forge
dask-cuda                 23.04.00a       py310_230215_g8134e6b_25    rapidsai-nightly
dask-cudf                 23.02.00a230209 cuda_11_py310_gac60656bc9_313    rapidsai-nightly
dask-sql                  2023.2.0+15.gf265f58           dev_0    <develop>

Creating a LocalCUDACluster fails:

from dask_cuda import LocalCUDACluster

if __name__=="__main__":
  cluster = LocalCUDACluster()

Trace:

2023-02-16 09:22:58,422 - distributed.deploy.spec - WARNING - Cluster closed without starting up
Traceback (most recent call last):
  File "/raid/rgelhausen/conda/envs/pynds3/lib/python3.10/site-packages/distributed/deploy/spec.py", line 319, in _start
    self.scheduler = cls(**self.scheduler_spec.get("options", {}))
  File "/raid/rgelhausen/conda/envs/pynds3/lib/python3.10/site-packages/distributed/scheduler.py", line 3662, in _
_init__
    ServerNode.__init__(
  File "/raid/rgelhausen/conda/envs/pynds3/lib/python3.10/site-packages/distributed/core.py", line 348, in __init__
    self.monitor = SystemMonitor()
  File "/raid/rgelhausen/conda/envs/pynds3/lib/python3.10/site-packages/distributed/system_monitor.py", line 96, in __init__
    gpu_extra = nvml.one_time()
  File "/raid/rgelhausen/conda/envs/pynds3/lib/python3.10/site-packages/distributed/diagnostics/nvml.py", line 336, in one_time
    "name": _get_name(h),
  File "/raid/rgelhausen/conda/envs/pynds3/lib/python3.10/site-packages/distributed/diagnostics/nvml.py", line 319, in _get_name
    return pynvml.nvmlDeviceGetName(h).decode()
AttributeError: 'str' object has no attribute 'decode'. Did you mean: 'encode'?

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/nfs/rgelhausen/projects/presto-test/simple.py", line 4, in <module>
    cluster = LocalCUDACluster()
  File "/raid/rgelhausen/conda/envs/pynds3/lib/python3.10/site-packages/dask_cuda/local_cuda_cluster.py", line 336, in __init__
    super().__init__(
  File "/raid/rgelhausen/conda/envs/pynds3/lib/python3.10/site-packages/distributed/deploy/local.py", line 253, in __init__
    super().__init__(
  File "/raid/rgelhausen/conda/envs/pynds3/lib/python3.10/site-packages/distributed/deploy/spec.py", line 286, in __init__
    self.sync(self._start)
  File "/raid/rgelhausen/conda/envs/pynds3/lib/python3.10/site-packages/distributed/utils.py", line 338, in sync
    return sync(
  File "/raid/rgelhausen/conda/envs/pynds3/lib/python3.10/site-packages/distributed/utils.py", line 405, in sync
    raise exc.with_traceback(tb)
  File "/raid/rgelhausen/conda/envs/pynds3/lib/python3.10/site-packages/distributed/utils.py", line 378, in f
    result = yield future
  File "/raid/rgelhausen/conda/envs/pynds3/lib/python3.10/site-packages/tornado/gen.py", line 769, in run
    value = future.result()
  File "/raid/rgelhausen/conda/envs/pynds3/lib/python3.10/site-packages/distributed/deploy/spec.py", line 330, in _start
    raise RuntimeError(f"Cluster failed to start: {e}") from e
RuntimeError: Cluster failed to start: 'str' object has no attribute 'decode'

The text was updated successfully, but these errors were encountered:

wence- · 2023-02-16T17:34:04Z

Should have been fixed by dask/distributed#7544 and #1118

Workaround, downgrade pynvml.

randerzander · 2023-02-16T17:43:15Z

mamba install -c conda-forge pynvml=11.4.1

Fixed this for me.

wence- · 2023-02-17T17:46:20Z

This was fixed in #1118, but one problem is that if you just mamba install -c rapidsai-nightly dask-cuda=23.04* then you don't get a distributed nightly AFAICT (see discussion in the distributed issue) and therefore things still break. @jakirkham what would be the right way to fix this?

jakirkham · 2023-02-18T01:42:18Z

One needs to add the dask/label/dev channel for Dask nightlies

wence- · 2023-03-01T18:42:34Z

I think this is now fixed.

wence- closed this as completed Mar 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to create LocalCUDACluster #1119

Unable to create LocalCUDACluster #1119

randerzander commented Feb 16, 2023

wence- commented Feb 16, 2023

randerzander commented Feb 16, 2023

wence- commented Feb 17, 2023

jakirkham commented Feb 18, 2023

wence- commented Mar 1, 2023

Unable to create LocalCUDACluster #1119

Unable to create LocalCUDACluster #1119

Comments

randerzander commented Feb 16, 2023

wence- commented Feb 16, 2023

randerzander commented Feb 16, 2023

wence- commented Feb 17, 2023

jakirkham commented Feb 18, 2023

wence- commented Mar 1, 2023