Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for CudaAsyncMemoryResource #566

Merged
merged 11 commits into from
Apr 12, 2021

Conversation

charlesbluca
Copy link
Member

Closes #565

Adds the --rmm-pool-async/rmm_pool_async option to the CLI and cluster to enable the use of rmm.mr.CudaAsyncMemoryResource in the RMM initialization.

@charlesbluca charlesbluca added python python code needed 3 - Ready for Review Ready for review by team feature request New feature or request non-breaking Non-breaking change labels Apr 9, 2021
@charlesbluca charlesbluca requested a review from a team as a code owner April 9, 2021 18:28
Copy link
Member Author

@charlesbluca charlesbluca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small questions/comments:

dask_cuda/cli/dask_cuda_worker.py Outdated Show resolved Hide resolved
dask_cuda/utils.py Show resolved Hide resolved
@charlesbluca
Copy link
Member Author

It looks like CUDA 11.2 has different behavior, in that the memory resource remains an rmm.mr.PoolMemoryResource regardless of if rmm_pool_async is enabled. Is there a way we can check for CUDA version in the tests to handle this?

@pentschev
Copy link
Member

cudaMallocAsync was only introduced in CUDA 11.2, https://github.com/rapidsai/rmm/blob/80bfeb2816845da5921551ae5c158e92427bec86/python/rmm/tests/test_rmm.py#L511-L539 will give you some ideas on how to test that async is both active and how you can skip tests for older CUDA versions.

@charlesbluca
Copy link
Member Author

charlesbluca commented Apr 9, 2021

Thanks @pentschev! In that case, I'm a little interested in what's happening for the 11.0 tests, as I would expect a RuntimeError upon trying to set up async there.

EDIT:

I'm assuming that the CudaMemoryResource that the workers return for rmm.mr.get_current_device_resource_type in the 11.0 cases is because RMMSetup is silently failing to allocate a pool, similar to what's happening in #443. With that knowledge, should we try to add some form of check that the user has 11.2 if they enable async, or wait on dask/distributed#4297 to begin handling worker plugin exceptions?

@codecov-io
Copy link

codecov-io commented Apr 9, 2021

Codecov Report

Merging #566 (22e0aa2) into branch-0.20 (cb9bf35) will decrease coverage by 0.47%.
The diff coverage is 52.32%.

❗ Current head 22e0aa2 differs from pull request most recent head fb89979. Consider uploading reports for the commit fb89979 to get more accurate results
Impacted file tree graph

@@               Coverage Diff               @@
##           branch-0.20     #566      +/-   ##
===============================================
- Coverage        61.06%   60.59%   -0.48%     
===============================================
  Files               22       22              
  Lines             2571     2644      +73     
===============================================
+ Hits              1570     1602      +32     
- Misses            1001     1042      +41     
Impacted Files Coverage Δ
dask_cuda/device_host_file.py 67.64% <30.61%> (-20.99%) ⬇️
dask_cuda/cuda_worker.py 78.04% <50.00%> (-0.71%) ⬇️
dask_cuda/local_cuda_cluster.py 78.78% <66.66%> (-2.61%) ⬇️
dask_cuda/utils.py 89.47% <85.71%> (-0.37%) ⬇️
dask_cuda/cli/dask_cuda_worker.py 95.45% <100.00%> (ø)
dask_cuda/explicit_comms/comms.py 98.87% <100.00%> (+0.09%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3439b83...fb89979. Read the comment docs.

@pentschev
Copy link
Member

I'm assuming that the CudaMemoryResource that the workers return for rmm.mr.get_current_device_resource_type in the 11.0 cases is because RMMSetup is silently failing to allocate a pool, similar to what's happening in #443. With that knowledge, should we try to add some form of check that the user has 11.2 if they enable async, or wait on dask/distributed#4297 to begin handling worker plugin exceptions?

It's possible, but some discussion seems to point that we don't want an async pool, but rather just switching to the async allocator, as commented in #565 (comment) . Could you update the PR here to reflect that? I think that will also make the assertion easier, you can probably then just do the same as https://github.com/rapidsai/rmm/blob/80bfeb2816845da5921551ae5c158e92427bec86/python/rmm/tests/test_rmm.py#L519-L521 .

@charlesbluca
Copy link
Member Author

Sure! Does this mean that async can be enabled for RMM pools and managed memory?

@pentschev
Copy link
Member

No, that means we're changing the allocator to the async allocator (which is neither the default nor the managed memory allocators), and without a pool. In summary, enabling the async allocator (something like --rmm-async-malloc) is mutually exclusive with both --rmm-pool-size and --rmm-managed-memory and would raise an error if that's attempted.

Copy link
Member Author

@charlesbluca charlesbluca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to reflect this; some questions:

dask_cuda/utils.py Outdated Show resolved Hide resolved
dask_cuda/utils.py Outdated Show resolved Hide resolved
incompatible with RMM pools and managed memory, and will be preferred over them
if both are enabled.""",
incompatible with RMM pools and managed memory, trying to enable both will
result in an exception.""",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is the CLI, is this really going to raise an exception or fail with an error instead? I'm not really sure what happens when an exception is raise in the CUDAWorker class when running the CLI tool.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this fails with an error - I mostly just copied the phrasing used to describe the behavior of using NVLink + managed memory:

"WARNING: managed memory is currently incompatible with NVLink, "
"trying to enable both will result in an exception.",
)

I can change this here and in #561 if it would make more sense to say the CLI will fail outright in these cases.

both enabled, if RMM pools / managed memory and asynchronous allocator are both
enabled, or if ``ucx_net_devices="auto"`` and:

- UCX-Py is not installed or wasn't compiled with hwloc support or
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- UCX-Py is not installed or wasn't compiled with hwloc support or
- UCX-Py is not installed or wasn't compiled with hwloc support; or

Comment on lines 154 to 159
If ``ucx_net_devices=""``, if NVLink and RMM managed memory are
both enabled, if RMM pools / managed memory and asynchronous allocator are both
enabled, or if ``ucx_net_devices="auto"`` and:

- UCX-Py is not installed or wasn't compiled with hwloc support or
- ``enable_infiniband=False``
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like you accidentally removed a white space from the beginning of each line in this block, which probably will result in a complaint when parsing the docstrings.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

Copy link
Member

@pentschev pentschev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @charlesbluca !

@charlesbluca charlesbluca added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Apr 12, 2021
@charlesbluca
Copy link
Member Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit fde564e into rapidsai:branch-0.20 Apr 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge feature request New feature or request non-breaking Non-breaking change python python code needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Extend CLI and LocalCluster to support CUDA Async Memory Resource
4 participants