-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for CudaAsyncMemoryResource
#566
Changes from 8 commits
aa96e84
992b644
568d426
4ec20ac
d5d446b
56d6b84
359746f
fb89979
e67befd
e20e0ed
6a96432
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -116,10 +116,18 @@ class LocalCUDACluster(LocalCluster): | |||||
.. warning:: | ||||||
Managed memory is currently incompatible with NVLink, trying to enable | ||||||
both will result in an exception. | ||||||
rmm_async: bool, default False | ||||||
Initialize each worker withh RMM and set it to use RMM's asynchronous allocator. | ||||||
See ``rmm.mr.CudaAsyncMemoryResource`` for more info. | ||||||
|
||||||
.. note:: | ||||||
The asynchronous allocator requires CUDA Toolkit 11.2 or newer. It is also | ||||||
incompatible with RMM pools and managed memory, trying to enable both will | ||||||
result in an exception. | ||||||
rmm_log_directory: str | ||||||
Directory to write per-worker RMM log files to; the client and scheduler | ||||||
are not logged here. Logging will only be enabled if ``rmm_pool_size`` or | ||||||
``rmm_managed_memory`` are specified. | ||||||
are not logged here. Logging will only be enabled if ``rmm_pool_size``, | ||||||
``rmm_managed_memory``, or ``rmm_async`` are specified. | ||||||
jit_unspill: bool | ||||||
If ``True``, enable just-in-time unspilling. This is experimental and doesn't | ||||||
support memory spilling to disk. Please see ``proxy_object.ProxyObject`` and | ||||||
|
@@ -143,10 +151,12 @@ class LocalCUDACluster(LocalCluster): | |||||
If ``enable_infiniband`` or ``enable_nvlink`` is ``True`` and protocol is not | ||||||
``"ucx"``. | ||||||
ValueError | ||||||
If ``ucx_net_devices`` is an empty string, or if it is ``"auto"`` and UCX-Py is | ||||||
not installed, or if it is ``"auto"`` and ``enable_infiniband=False``, or UCX-Py | ||||||
wasn't compiled with hwloc support, or both RMM managed memory and | ||||||
NVLink are enabled. | ||||||
If ``ucx_net_devices=""``, if NVLink and RMM managed memory are | ||||||
both enabled, if RMM pools / managed memory and asynchronous allocator are both | ||||||
enabled, or if ``ucx_net_devices="auto"`` and: | ||||||
|
||||||
- UCX-Py is not installed or wasn't compiled with hwloc support or | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
- ``enable_infiniband=False`` | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Seems like you accidentally removed a white space from the beginning of each line in this block, which probably will result in a complaint when parsing the docstrings. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good catch! |
||||||
|
||||||
See Also | ||||||
-------- | ||||||
|
@@ -169,6 +179,7 @@ def __init__( | |||||
ucx_net_devices=None, | ||||||
rmm_pool_size=None, | ||||||
rmm_managed_memory=False, | ||||||
rmm_async=False, | ||||||
rmm_log_directory=None, | ||||||
jit_unspill=None, | ||||||
log_spilling=False, | ||||||
|
@@ -201,6 +212,7 @@ def __init__( | |||||
|
||||||
self.rmm_pool_size = rmm_pool_size | ||||||
self.rmm_managed_memory = rmm_managed_memory | ||||||
self.rmm_async = rmm_async | ||||||
if rmm_pool_size is not None or rmm_managed_memory: | ||||||
try: | ||||||
import rmm # noqa F401 | ||||||
|
@@ -210,6 +222,11 @@ def __init__( | |||||
"is not available. For installation instructions, please " | ||||||
"see https://github.com/rapidsai/rmm" | ||||||
) # pragma: no cover | ||||||
if self.rmm_async: | ||||||
raise ValueError( | ||||||
"""RMM pool and managed memory are incompatible with asynchronous | ||||||
allocator""" | ||||||
) | ||||||
if self.rmm_pool_size is not None: | ||||||
self.rmm_pool_size = parse_bytes(self.rmm_pool_size) | ||||||
else: | ||||||
|
@@ -332,6 +349,7 @@ def new_worker_spec(self): | |||||
RMMSetup( | ||||||
self.rmm_pool_size, | ||||||
self.rmm_managed_memory, | ||||||
self.rmm_async, | ||||||
self.rmm_log_directory, | ||||||
), | ||||||
}, | ||||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is the CLI, is this really going to raise an exception or fail with an error instead? I'm not really sure what happens when an exception is raise in the
CUDAWorker
class when running the CLI tool.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this fails with an error - I mostly just copied the phrasing used to describe the behavior of using NVLink + managed memory:
dask-cuda/dask_cuda/cli/dask_cuda_worker.py
Lines 112 to 114 in 1526017
I can change this here and in #561 if it would make more sense to say the CLI will fail outright in these cases.