Add support for `CudaAsyncMemoryResource` #566

charlesbluca · 2021-04-09T18:28:36Z

Closes #565

Adds the --rmm-pool-async/rmm_pool_async option to the CLI and cluster to enable the use of rmm.mr.CudaAsyncMemoryResource in the RMM initialization.

charlesbluca

Some small questions/comments:

dask_cuda/cli/dask_cuda_worker.py

dask_cuda/utils.py

charlesbluca · 2021-04-09T19:43:01Z

It looks like CUDA 11.2 has different behavior, in that the memory resource remains an rmm.mr.PoolMemoryResource regardless of if rmm_pool_async is enabled. Is there a way we can check for CUDA version in the tests to handle this?

pentschev · 2021-04-09T20:17:10Z

cudaMallocAsync was only introduced in CUDA 11.2, https://github.com/rapidsai/rmm/blob/80bfeb2816845da5921551ae5c158e92427bec86/python/rmm/tests/test_rmm.py#L511-L539 will give you some ideas on how to test that async is both active and how you can skip tests for older CUDA versions.

charlesbluca · 2021-04-09T20:33:36Z

Thanks @pentschev! In that case, I'm a little interested in what's happening for the 11.0 tests, as I would expect a RuntimeError upon trying to set up async there.

EDIT:

I'm assuming that the CudaMemoryResource that the workers return for rmm.mr.get_current_device_resource_type in the 11.0 cases is because RMMSetup is silently failing to allocate a pool, similar to what's happening in #443. With that knowledge, should we try to add some form of check that the user has 11.2 if they enable async, or wait on dask/distributed#4297 to begin handling worker plugin exceptions?

codecov-io · 2021-04-09T21:16:58Z

Codecov Report

Merging #566 (22e0aa2) into branch-0.20 (cb9bf35) will decrease coverage by 0.47%.
The diff coverage is 52.32%.

❗ Current head 22e0aa2 differs from pull request most recent head fb89979. Consider uploading reports for the commit fb89979 to get more accurate results

@@               Coverage Diff               @@
##           branch-0.20     #566      +/-   ##
===============================================
- Coverage        61.06%   60.59%   -0.48%     
===============================================
  Files               22       22              
  Lines             2571     2644      +73     
===============================================
+ Hits              1570     1602      +32     
- Misses            1001     1042      +41

Impacted Files	Coverage Δ
dask_cuda/device_host_file.py	`67.64% <30.61%> (-20.99%)`	⬇️
dask_cuda/cuda_worker.py	`78.04% <50.00%> (-0.71%)`	⬇️
dask_cuda/local_cuda_cluster.py	`78.78% <66.66%> (-2.61%)`	⬇️
dask_cuda/utils.py	`89.47% <85.71%> (-0.37%)`	⬇️
dask_cuda/cli/dask_cuda_worker.py	`95.45% <100.00%> (ø)`
dask_cuda/explicit_comms/comms.py	`98.87% <100.00%> (+0.09%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3439b83...fb89979. Read the comment docs.

pentschev · 2021-04-09T21:26:05Z

I'm assuming that the CudaMemoryResource that the workers return for rmm.mr.get_current_device_resource_type in the 11.0 cases is because RMMSetup is silently failing to allocate a pool, similar to what's happening in #443. With that knowledge, should we try to add some form of check that the user has 11.2 if they enable async, or wait on dask/distributed#4297 to begin handling worker plugin exceptions?

It's possible, but some discussion seems to point that we don't want an async pool, but rather just switching to the async allocator, as commented in #565 (comment) . Could you update the PR here to reflect that? I think that will also make the assertion easier, you can probably then just do the same as https://github.com/rapidsai/rmm/blob/80bfeb2816845da5921551ae5c158e92427bec86/python/rmm/tests/test_rmm.py#L519-L521 .

charlesbluca · 2021-04-09T21:37:34Z

Sure! Does this mean that async can be enabled for RMM pools and managed memory?

pentschev · 2021-04-09T21:41:51Z

No, that means we're changing the allocator to the async allocator (which is neither the default nor the managed memory allocators), and without a pool. In summary, enabling the async allocator (something like --rmm-async-malloc) is mutually exclusive with both --rmm-pool-size and --rmm-managed-memory and would raise an error if that's attempted.

charlesbluca

Updated to reflect this; some questions:

dask_cuda/utils.py

pentschev · 2021-04-12T18:00:28Z

dask_cuda/cli/dask_cuda_worker.py

-        incompatible with RMM pools and managed memory, and will be preferred over them
-        if both are enabled.""",
+        incompatible with RMM pools and managed memory, trying to enable both will
+        result in an exception.""",


Since this is the CLI, is this really going to raise an exception or fail with an error instead? I'm not really sure what happens when an exception is raise in the CUDAWorker class when running the CLI tool.

It looks like this fails with an error - I mostly just copied the phrasing used to describe the behavior of using NVLink + managed memory:

dask-cuda/dask_cuda/cli/dask_cuda_worker.py

Lines 112 to 114 in 1526017

"WARNING: managed memory is currently incompatible with NVLink, "

"trying to enable both will result in an exception.",

)

I can change this here and in #561 if it would make more sense to say the CLI will fail outright in these cases.

pentschev · 2021-04-12T18:02:45Z

dask_cuda/local_cuda_cluster.py

+       both enabled, if RMM pools / managed memory and asynchronous allocator are both
+       enabled, or if ``ucx_net_devices="auto"`` and:
+
+        - UCX-Py is not installed or wasn't compiled with hwloc support or


Suggested change

- UCX-Py is not installed or wasn't compiled with hwloc support or

- UCX-Py is not installed or wasn't compiled with hwloc support; or

pentschev · 2021-04-12T18:03:13Z

dask_cuda/local_cuda_cluster.py

+       If ``ucx_net_devices=""``, if NVLink and RMM managed memory are
+       both enabled, if RMM pools / managed memory and asynchronous allocator are both
+       enabled, or if ``ucx_net_devices="auto"`` and:
+
+        - UCX-Py is not installed or wasn't compiled with hwloc support or
+        - ``enable_infiniband=False``


Seems like you accidentally removed a white space from the beginning of each line in this block, which probably will result in a complaint when parsing the docstrings.

Good catch!

pentschev

LGTM, thanks @charlesbluca !

charlesbluca · 2021-04-12T19:56:31Z

@gpucibot merge

Add support for CUDA async memory resource

aa96e84

charlesbluca added python python code needed 3 - Ready for Review Ready for review by team feature request New feature or request non-breaking Non-breaking change labels Apr 9, 2021

charlesbluca requested a review from a team as a code owner April 9, 2021 18:28

charlesbluca commented Apr 9, 2021

View reviewed changes

dask_cuda/cli/dask_cuda_worker.py Outdated Show resolved Hide resolved

dask_cuda/utils.py Show resolved Hide resolved

Skip async test for <11.2

992b644

Async allocator is mutually exclusive with pools / managed memory

568d426

charlesbluca commented Apr 9, 2021

View reviewed changes

dask_cuda/utils.py Outdated Show resolved Hide resolved

dask_cuda/utils.py Outdated Show resolved Hide resolved

charlesbluca added 5 commits April 12, 2021 07:05

Raise ValueError if pools / managed memory and async both enabled

4ec20ac

Don't reinitialize for RMM async

d5d446b

Change kwarg in tests

56d6b84

Add logging

359746f

Go back to mutually exclusive async / pool setup

fb89979

pentschev reviewed Apr 12, 2021

View reviewed changes

charlesbluca added 3 commits April 12, 2021 11:34

Fix cluster docstring

e67befd

Fix ValueError message

e20e0ed

Clarify failure in CLI

6a96432

pentschev approved these changes Apr 12, 2021

View reviewed changes

charlesbluca added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Apr 12, 2021

rapids-bot bot merged commit fde564e into rapidsai:branch-0.20 Apr 12, 2021

charlesbluca mentioned this pull request May 10, 2021

When to use rmm_async #602

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for `CudaAsyncMemoryResource` #566

Add support for `CudaAsyncMemoryResource` #566

charlesbluca commented Apr 9, 2021

charlesbluca left a comment •

edited

Loading

charlesbluca commented Apr 9, 2021

pentschev commented Apr 9, 2021

charlesbluca commented Apr 9, 2021 •

edited

Loading

codecov-io commented Apr 9, 2021 •

edited

Loading

pentschev commented Apr 9, 2021

charlesbluca commented Apr 9, 2021

pentschev commented Apr 9, 2021

charlesbluca left a comment

pentschev Apr 12, 2021

charlesbluca Apr 12, 2021

pentschev Apr 12, 2021

pentschev Apr 12, 2021

charlesbluca Apr 12, 2021

pentschev left a comment

charlesbluca commented Apr 12, 2021

	"WARNING: managed memory is currently incompatible with NVLink, "
	"trying to enable both will result in an exception.",
	)

	- UCX-Py is not installed or wasn't compiled with hwloc support or
	- UCX-Py is not installed or wasn't compiled with hwloc support; or

Add support for CudaAsyncMemoryResource #566

Add support for CudaAsyncMemoryResource #566

Conversation

charlesbluca commented Apr 9, 2021

charlesbluca left a comment • edited Loading

Choose a reason for hiding this comment

charlesbluca commented Apr 9, 2021

pentschev commented Apr 9, 2021

charlesbluca commented Apr 9, 2021 • edited Loading

codecov-io commented Apr 9, 2021 • edited Loading

Codecov Report

pentschev commented Apr 9, 2021

charlesbluca commented Apr 9, 2021

pentschev commented Apr 9, 2021

charlesbluca left a comment

Choose a reason for hiding this comment

pentschev Apr 12, 2021

Choose a reason for hiding this comment

charlesbluca Apr 12, 2021

Choose a reason for hiding this comment

pentschev Apr 12, 2021

Choose a reason for hiding this comment

pentschev Apr 12, 2021

Choose a reason for hiding this comment

charlesbluca Apr 12, 2021

Choose a reason for hiding this comment

pentschev left a comment

Choose a reason for hiding this comment

charlesbluca commented Apr 12, 2021

Add support for `CudaAsyncMemoryResource` #566

Add support for `CudaAsyncMemoryResource` #566

charlesbluca left a comment •

edited

Loading

charlesbluca commented Apr 9, 2021 •

edited

Loading

codecov-io commented Apr 9, 2021 •

edited

Loading