-
Notifications
You must be signed in to change notification settings - Fork 200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent DeviceBuffer
DeviceMemoryResource premature release
#931
Prevent DeviceBuffer
DeviceMemoryResource premature release
#931
Conversation
Thanks for the PR! I think we need a better understanding of the problem, exactly when it arises, and why the changes in this PR fix it. If we haven't been able to find a minimal reproducer yet, do you perhaps have a larger program that reproduces the segfault? Do you think maybe simply reversing the order of declaration of |
Thanks for taking a look!
I remember trying that, unfortunately, it didn't seemed to work. What I noticed is that the DeviceMemoryResource's In the C++ code, we can see the following:
While
In my understanding,
Sure, I can help you reproduce it through NVIDIA's internal communication means. |
As an update: using no_gc_clear with |
I believe this should be a more minimal reproducer of the problem: import rmm
import gc
class Foo():
def __init__(self, x):
self.x = x
rmm.mr.set_current_device_resource(rmm.mr.CudaMemoryResource())
dbuf1 = rmm.DeviceBuffer(size=10)
# Make dbuf1 part of a reference cycle:
l1 = Foo(dbuf1)
l2 = Foo(dbuf1)
l1.l2 = l2
l2.l1 = l1
# due to the reference cycle, the device buffer doesn't actually get
# cleaned up until later, after we invoke `gc.collect()`:
del dbuf1, l1, l2
rmm.mr.set_current_device_resource(rmm.mr.CudaMemoryResource())
# by now, the only remaining reference to the *original* memory
# resource should be in `dbuf1`. However, the cyclic garbage collector
# will eliminate that reference when it clears the object via its
# `tp_clear` method. Later, when `tp_dealloc` attemps to actually
# deallocate `dbuf1` (which needs the MR alive), a segfault occurs.
gc.collect() As suspected, we are seeing exactly the situation described in https://cython.readthedocs.io/en/latest/src/userguide/extension_types.html#disabling-cycle-breaking-tp-clear. Setting The one caveat with
I cannot think of a way to construct such a reference cycle. |
@viclafargue I think we should avoid the indirection via |
37a5171
to
75c0e2f
Compare
Wrote a fix as discussed with @shwina through DM. It involves storing a |
The solution actually resulted in segfaults when using a |
rerun tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me. Going to merge!
@gpucibot merge |
Please remove WIP from PRS before merging. |
DeviceBuffer
DeviceBuffer
DeviceMemoryResource premature release
Gave this PR a new title, posthumously. Would hate to end up with just "Fix for DeviceBuffer" in our release notes! @viclafargue Please be thoughtful and descriptive in PR titles. |
DeviceBuffer
deallocation sometimes fails with a segfault. I wasn't able to create a minimal reproducer for now. However, a possible source for the problem is the way Cython releases the memory. Indeed the DeviceBuffer'sDeviceMemoryResource
appears to be sometimes released before thedevice_buffer
. This causes a segfault in the thedevice_buffer
deconstructor as the device memory resource pointer is set to null.The code of this PR seems to fix the issue. It makes use of a holder class that seem to allow proper ordering of the class members destructions.