Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[JNI] remove rmm argument to set rw access for fabric handles #17553

Merged

Conversation

abellina
Copy link
Contributor

@abellina abellina commented Dec 9, 2024

Description

This is a follow up from #17526, where fabric handles can be enabled from RMM. That PR also sets the memory access protection flag (cudaMemPoolSetAccess), but I have learned that this second flag is not needed from the owner device. In fact, it causes confusion because the owning device fails to call this function with some of the flags (access none). cudaMemPoolSetAccess is meant to only be called from peer processes that have imported the pool's handle. In our case, UCX handles this from the peer's side and it does not need to be anywhere in RMM or cuDF.

Sorry for the noise. I'd like to get this fix in, and then I am going to fix RMM by removing that API.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@abellina abellina requested a review from a team as a code owner December 9, 2024 21:00
Copy link

copy-pr-bot bot commented Dec 9, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added the Java Affects Java cuDF API. label Dec 9, 2024
@abellina abellina added the bug Something isn't working label Dec 9, 2024
@abellina abellina removed their assignment Dec 9, 2024
@abellina abellina added the non-breaking Non-breaking change label Dec 9, 2024
@abellina
Copy link
Contributor Author

abellina commented Dec 9, 2024

/ok to test

@abellina
Copy link
Contributor Author

abellina commented Dec 9, 2024

/merge

@rapids-bot rapids-bot bot merged commit 5b412dc into rapidsai:branch-25.02 Dec 9, 2024
105 of 106 checks passed
@abellina abellina deleted the simplify_add_fabric_handles branch December 9, 2024 22:55
rapids-bot bot pushed a commit to rapidsai/rmm that referenced this pull request Dec 9, 2024
Closes #1753
It is a follow up from #1743

I would like for rapidsai/cudf#17553 to merge first, that way I don't break the build.

I've learned that I was using `cudaMemPoolSetAccess` incorrectly. This API should only be used from a `peer` device, not from the device that created the pool. This is the reason why calling `cudaMemPoolSetAccess` with none throws an error as documented here #1753. I have tested that I can still export the fabric handles and import them using UCX in a peer device with the default access that pool owner device gets (read+write is the default). Note that this read+write default access cannot be revoked from the owner, as it wouldn't make sense to have memory that nobody has access to, but peers can call `cudaMemPoolSetAccess` to gain read+write access or to stop accessing (none) a peer's pool memory.

Authors:
  - Alessandro Bellina (https://github.com/abellina)

Approvers:
  - Bradley Dice (https://github.com/bdice)

URL: #1754
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Java Affects Java cuDF API. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants