Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add RMM PyTorch allocator #1168
Add RMM PyTorch allocator #1168
Changes from 6 commits
f1e9413
53a18ab
aab7e07
63f2692
3fcd2ec
6404dac
73e0846
dc16ef4
9c27e85
39251db
6713361
a53ce95
741a1df
4534bca
dbf43b7
db5fc52
2b343c2
232fbd0
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this use
from rmm._lib.memory_resource cimport get_current_device_resource
like indevice_buffer.pyx
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeah I think we should have a single declaration in
memory_resource.pxd
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a new
per_device_resource.pxd
file from which we cancimport
the function.Note that we cannot get away with declaring
get_current_device_resource
inmemory_resource.pxd
, becausememory_resource
exports acpdef
function also namedget_current_device_resource
that is a wrapper around the C++ function.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q:
device
is ignored by design? (I was reviewing cupy/cupy#7210 and noticed this.)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, great catch! This brings out a subtle problem:
In RMM, each device has its own memory resource. Thus, to do the allocation on a specified
device
with RMM, I would write the torchallocate
function like this:Unforunately, the deallocation function does not accept a
device
argument, so we cannot retrieve the memory resource that was used for allocation:I don't really see a way around this other than for the
deallocate
signature to include thedevice
argument. cc: @emcastilloThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would love to know too. TBH I am puzzled by PyTorch's (long-time) behavior of asking for
device
. It should just honor the current device...There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll submit a follow-up PR adding support for
device
, once pytorch/pytorch#91398 is merged.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are these gil requiring functions? It seems like it's all pure C code here, no Python objects etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear, if they're getting called from GIL-requiring code in PyTorch, so be it. I just don't see a reason that these functions need to explicitly acquire the GIL. If PyT can call these in a nogil context, is there a reason for us not to allow that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the
allocate()
anddeallocate()
methods can involve Python operations on Python objects, e.g., inCallbackMemoryResource
orFailureCallbackResourceAdaptor
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking that would be automatically be handled when those callbacks are invoked since those callbacks are stored as Python objects. Those are stored as Python objects in the class, so any interaction with them should reacquire the GIL already, right? I guess the potential issue is that we cast these to
void *
pointers before passing them to the C++ classes, so at the point of the call we've lost Cython's safety net. Is that right? If so, we should consider (out of scope for this PR of course) inserting the necessary Python C API calls into the relevant rmm C++ classes i.e. infailure_callback_resource_adaptor::do_allocate
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was my expectation as well. If the callback touches Python objects, shouldn't it be the responsibility of the callback to acquire/release the GIL?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My proposal above was wrong, there's no reason to embed this information in librmm where the callbacks could be anything (not necessarily Python objects). However, it should be the responsibility of the callbacks in rmm's Cython code to acquire the GIL as needed, and we do appear to do this correctly already. The
_oom_callback_function
used by theFailureCallbackResourceAdaptor
acquires the GIL before calling the user-provided callback, as do both the allocate and deallocate callbacks used by the CallbackMemoryResource.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup -- the GIL should neither be released in C++, nor can it be released in Python. The Cython "wrapper" functions are what need to take on the responsibility of handling the GIL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Vyas and I did a bit more exploration of exactly why we need a
with gil
here and ended up quite deep in the CPython and Cython internals (still without a clear answer though).The symptom though is clear. If you take the example I have in the PR description:
And raise an error in
allocate_func
, while removing thewith gil
, you'll see that the error is uncaught and eventually this segfaults.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this bite a user that accesses this attribute and passes it around thinking that it's fine when it's really just None? It might be safer to override
__getattr__
for the module and have it raise an error to prevent the user from accessing this attribute whenCUDAPluggableAllocator
failed to import.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we alternatively
pass
onImportError
to achieve the same effect as defining that module__getattr__
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you'd get close to the same effect, just a slightly less user-friendly version. With a
__getattr__
override you could provide a more friendly error message indicating that this happened because the torch allocator failed to import, whereas if you just avoid defining it the user will see anAttributeError
without any additional diagnostics and may think it's a bug in rmm.It's a very minor point though, I'm fine leaving this as is for now and only revisiting in the future if we get a lot of user questions about why the allocator is None.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is neat!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: Would this honor
rmm.reinitialize()
if a user changes the MR?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rmm.reinitialize()
resets the default memory resource used by RMM. Each call toallocate()
anddeallocate()
queries the default memory resource via a call toget_current_device_resource()
, so -- yes.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be set up as a
yield
fixture and reset the current device resource afterward?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have an
autouse
function scoped fixture that does that: https://github.com/rapidsai/rmm/blob/branch-23.02/python/rmm/tests/test_rmm.py#L46. I'm guessing that should just work as expected?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the except/pass here for? Can we call
change_current_allocator
in anelse:
block from the firsttry:
?Alternatively can we use
pytest.importorskip
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I think it's because a
RuntimeError
is raised if you set the torch allocator twice in the same session. Maybe we should use a session scoped fixture instead?