-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for CudaAsyncMemoryResource
#566
Add support for CudaAsyncMemoryResource
#566
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some small questions/comments:
It looks like CUDA 11.2 has different behavior, in that the memory resource remains an |
|
Thanks @pentschev! In that case, I'm a little interested in what's happening for the 11.0 tests, as I would expect a EDIT: I'm assuming that the |
Codecov Report
@@ Coverage Diff @@
## branch-0.20 #566 +/- ##
===============================================
- Coverage 61.06% 60.59% -0.48%
===============================================
Files 22 22
Lines 2571 2644 +73
===============================================
+ Hits 1570 1602 +32
- Misses 1001 1042 +41
Continue to review full report at Codecov.
|
It's possible, but some discussion seems to point that we don't want an async pool, but rather just switching to the async allocator, as commented in #565 (comment) . Could you update the PR here to reflect that? I think that will also make the assertion easier, you can probably then just do the same as https://github.com/rapidsai/rmm/blob/80bfeb2816845da5921551ae5c158e92427bec86/python/rmm/tests/test_rmm.py#L519-L521 . |
Sure! Does this mean that async can be enabled for RMM pools and managed memory? |
No, that means we're changing the allocator to the async allocator (which is neither the default nor the managed memory allocators), and without a pool. In summary, enabling the async allocator (something like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to reflect this; some questions:
dask_cuda/cli/dask_cuda_worker.py
Outdated
incompatible with RMM pools and managed memory, and will be preferred over them | ||
if both are enabled.""", | ||
incompatible with RMM pools and managed memory, trying to enable both will | ||
result in an exception.""", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is the CLI, is this really going to raise an exception or fail with an error instead? I'm not really sure what happens when an exception is raise in the CUDAWorker
class when running the CLI tool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this fails with an error - I mostly just copied the phrasing used to describe the behavior of using NVLink + managed memory:
dask-cuda/dask_cuda/cli/dask_cuda_worker.py
Lines 112 to 114 in 1526017
"WARNING: managed memory is currently incompatible with NVLink, " | |
"trying to enable both will result in an exception.", | |
) |
I can change this here and in #561 if it would make more sense to say the CLI will fail outright in these cases.
dask_cuda/local_cuda_cluster.py
Outdated
both enabled, if RMM pools / managed memory and asynchronous allocator are both | ||
enabled, or if ``ucx_net_devices="auto"`` and: | ||
|
||
- UCX-Py is not installed or wasn't compiled with hwloc support or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- UCX-Py is not installed or wasn't compiled with hwloc support or | |
- UCX-Py is not installed or wasn't compiled with hwloc support; or |
dask_cuda/local_cuda_cluster.py
Outdated
If ``ucx_net_devices=""``, if NVLink and RMM managed memory are | ||
both enabled, if RMM pools / managed memory and asynchronous allocator are both | ||
enabled, or if ``ucx_net_devices="auto"`` and: | ||
|
||
- UCX-Py is not installed or wasn't compiled with hwloc support or | ||
- ``enable_infiniband=False`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like you accidentally removed a white space from the beginning of each line in this block, which probably will result in a complaint when parsing the docstrings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @charlesbluca !
@gpucibot merge |
Closes #565
Adds the
--rmm-pool-async
/rmm_pool_async
option to the CLI and cluster to enable the use ofrmm.mr.CudaAsyncMemoryResource
in the RMM initialization.