Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add RMM PyTorch allocator #1168
Add RMM PyTorch allocator #1168
Changes from 16 commits
f1e9413
53a18ab
aab7e07
63f2692
3fcd2ec
6404dac
73e0846
dc16ef4
9c27e85
39251db
6713361
a53ce95
741a1df
4534bca
dbf43b7
db5fc52
2b343c2
232fbd0
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q:
device
is ignored by design? (I was reviewing cupy/cupy#7210 and noticed this.)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, great catch! This brings out a subtle problem:
In RMM, each device has its own memory resource. Thus, to do the allocation on a specified
device
with RMM, I would write the torchallocate
function like this:Unforunately, the deallocation function does not accept a
device
argument, so we cannot retrieve the memory resource that was used for allocation:I don't really see a way around this other than for the
deallocate
signature to include thedevice
argument. cc: @emcastilloThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would love to know too. TBH I am puzzled by PyTorch's (long-time) behavior of asking for
device
. It should just honor the current device...There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll submit a follow-up PR adding support for
device
, once pytorch/pytorch#91398 is merged.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are these gil requiring functions? It seems like it's all pure C code here, no Python objects etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear, if they're getting called from GIL-requiring code in PyTorch, so be it. I just don't see a reason that these functions need to explicitly acquire the GIL. If PyT can call these in a nogil context, is there a reason for us not to allow that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the
allocate()
anddeallocate()
methods can involve Python operations on Python objects, e.g., inCallbackMemoryResource
orFailureCallbackResourceAdaptor
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking that would be automatically be handled when those callbacks are invoked since those callbacks are stored as Python objects. Those are stored as Python objects in the class, so any interaction with them should reacquire the GIL already, right? I guess the potential issue is that we cast these to
void *
pointers before passing them to the C++ classes, so at the point of the call we've lost Cython's safety net. Is that right? If so, we should consider (out of scope for this PR of course) inserting the necessary Python C API calls into the relevant rmm C++ classes i.e. infailure_callback_resource_adaptor::do_allocate
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was my expectation as well. If the callback touches Python objects, shouldn't it be the responsibility of the callback to acquire/release the GIL?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My proposal above was wrong, there's no reason to embed this information in librmm where the callbacks could be anything (not necessarily Python objects). However, it should be the responsibility of the callbacks in rmm's Cython code to acquire the GIL as needed, and we do appear to do this correctly already. The
_oom_callback_function
used by theFailureCallbackResourceAdaptor
acquires the GIL before calling the user-provided callback, as do both the allocate and deallocate callbacks used by the CallbackMemoryResource.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup -- the GIL should neither be released in C++, nor can it be released in Python. The Cython "wrapper" functions are what need to take on the responsibility of handling the GIL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Vyas and I did a bit more exploration of exactly why we need a
with gil
here and ended up quite deep in the CPython and Cython internals (still without a clear answer though).The symptom though is clear. If you take the example I have in the PR description:
And raise an error in
allocate_func
, while removing thewith gil
, you'll see that the error is uncaught and eventually this segfaults.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this bite a user that accesses this attribute and passes it around thinking that it's fine when it's really just None? It might be safer to override
__getattr__
for the module and have it raise an error to prevent the user from accessing this attribute whenCUDAPluggableAllocator
failed to import.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we alternatively
pass
onImportError
to achieve the same effect as defining that module__getattr__
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you'd get close to the same effect, just a slightly less user-friendly version. With a
__getattr__
override you could provide a more friendly error message indicating that this happened because the torch allocator failed to import, whereas if you just avoid defining it the user will see anAttributeError
without any additional diagnostics and may think it's a bug in rmm.It's a very minor point though, I'm fine leaving this as is for now and only revisiting in the future if we get a lot of user questions about why the allocator is None.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is neat!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: Would this honor
rmm.reinitialize()
if a user changes the MR?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rmm.reinitialize()
resets the default memory resource used by RMM. Each call toallocate()
anddeallocate()
queries the default memory resource via a call toget_current_device_resource()
, so -- yes.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a pytest utility for this if you want to use it.
pytest.importorskip
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does that handle
import
ing specific functions?This is using
importorskip
fortorch
generally above. Thetorch.cuda.memory
module has been around for a while. Though the functionality we need from it is pretty new.Maybe in the future this could require a specific PyTorch version. There doesn't seem to be one yet that has what we need though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No,
pytest.importorskip
only handles modules and you have to use attribute accessors to get functions. It's kind of messy. The current solution is probably easier to read, let's keep it as-is.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this function behave if passed None (the case where the torch allocator hasn't been defined)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm - I wouldn't expect it to be
None
if we were able to importchange_current_allocator
, since the existence ofchange_current_allocator
implies thatrmm_torch_allocator
was defined (although somewhat implicitly:change_current_allocator
andCudaPluggableAllocator
in PyTorch were introduced together).Should we also skip this test if
rmm.rmm_torch_allocator
isNone
for some reason?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah OK I wasn't sure about that, I thought they weren't introduced entirely concurrently. Up to you on the extra skip, it sounds like it would be pedantically correct but not practically necessary.