Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Consider cleaning up rmm global state modification in spilling #14960

Closed
vyasr opened this issue Feb 2, 2024 · 1 comment
Closed

[BUG] Consider cleaning up rmm global state modification in spilling #14960

vyasr opened this issue Feb 2, 2024 · 1 comment
Assignees
Labels
0 - Backlog In queue waiting for assignment bug Something isn't working

Comments

@vyasr
Copy link
Contributor

vyasr commented Feb 2, 2024

Describe the bug
Currently the SpillManager modifies rmm's current device resource on construction to allow itself to detect when allocations fail and trigger spilling. However, this memory resource change is not reverted if the manager goes out of scope. This causes problems in situations where managers may be created and then deleted as was discovered in #14958.

Expected behavior
Either the spill manager should attempt to remove the modification of the mr when it is garbage collected, or the test that was skipped in #14958 should be removed. If we attempt the former, one of the challenges will be that the user could have set a different memory resource themselves.

In fact, this reveals a flaw in the current spilling logic whereby a user could enable spilling but then set the memory resource, which would invalidate the spilling approach. I'm not sure that there is a safe way to handle this right now other than modifying cudf to store a default memory resource that is passed to every algorithm rather than relying on rmm's default memory resource. That way, when spilling is enabled cudf could modify its default memory resource, and users modifying rmm's memory resource would have no effect, while attempting to modify cudf's memory resource would either raise an error or handle doing this in a spilling-safe manner (i.e. by wrapping the new memory resource in the callback adaptor).

This dovetails with some discussions in #14229 around the fact that default memory resources in libcudf's function signatures open us up to some pretty subtle ways because cuDF Python often leverages libcudf and [lib]rmm in nontrivial ways.

@vyasr vyasr added bug Something isn't working 0 - Backlog In queue waiting for assignment labels Feb 2, 2024
@madsbk
Copy link
Member

madsbk commented Jul 22, 2024

This should be fixed in #15436, which introduces the spill_on_demand_globally context manager

@madsbk madsbk closed this as completed Jul 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Backlog In queue waiting for assignment bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants