Add RMM PyTorch allocator #1168

shwina · 2022-11-29T12:36:43Z

This PR adds an RMM-based allocator for PyTorch, rmm.rmm_torch_allocator.

This enables, e.g., using the same memory pool in code that uses both RAPIDS and PyTorch. It also enables PyTorch to use all of the different memory resources provided by RMM. For example:

import rmm                                                                                                                                                                                                 
import torch

torch.cuda.memory.change_current_allocator(rmm.rmm_torch_allocator)
                                                                                                                                                                                                           
base_mr = rmm.mr.CudaMemoryResource()                                                                                                                                                                      
                                                                                                                                                                                                           
def allocate_func(size):                                                                                                                                                                                   
    print(f"Allocating {size} bytes")                                                                                                                                                                      
    return base_mr.allocate(size)                                                                                                                                                                          
                                                                                                                                                                                                           
def deallocate_func(ptr, size):                                                                                                                                                                            
    print(f"Deallocating {size} bytes")                                                                                                                                                                    
    return base_mr.deallocate(ptr, size)                                                                                                                                                                   
                                                                                                                                                                                                           
rmm.mr.set_current_device_resource(                                                                                                                                                                        
    rmm.mr.CallbackMemoryResource(allocate_func, deallocate_func)                                                                                                                                          
)                                                                                                                                                                                                          
                                                                                                                                                                                                           
x = torch.tensor([1, 2]).cuda()                                                                                                                                                                            
del x                                                                                                                                                                                                      
y = torch.tensor([1, 2, 3]).cuda()                                                                                                                                                                         
del y

Output:

Allocating 16 bytes
Deallocating 16 bytes
Allocating 24 bytes
Deallocating 24 bytes

…-allocator

leofang

LGTM overall, left a few questions.

leofang · 2022-12-13T16:27:37Z

python/rmm/rmm.py

+except ImportError:
+    rmm_torch_allocator = None
+else:
+    _alloc_free_lib_path = rmm._lib.torch_allocator.__file__


This is neat!

python/rmm/tests/test_rmm.py

leofang · 2022-12-13T16:32:02Z

python/rmm/rmm.py

+    rmm_torch_allocator = CUDAPluggableAllocator(
+        _alloc_free_lib_path,
+        alloc_fn_name="allocate",
+        free_fn_name="deallocate",
+    )


Q: Would this honor rmm.reinitialize() if a user changes the MR?

rmm.reinitialize() resets the default memory resource used by RMM. Each call to allocate() and deallocate() queries the default memory resource via a call to get_current_device_resource(), so -- yes.

bdice

This is awesome. A few comments.

bdice · 2022-12-13T16:11:46Z

python/rmm/_lib/torch_allocator.pyx

+
+
+cdef extern from "rmm/mr/device/per_device_resource.hpp" namespace "rmm" nogil:
+    cdef device_memory_resource* get_current_device_resource \


Should this use from rmm._lib.memory_resource cimport get_current_device_resource like in device_buffer.pyx?

Ah yeah I think we should have a single declaration in memory_resource.pxd

I added a new per_device_resource.pxd file from which we can cimport the function.

Note that we cannot get away with declaring get_current_device_resource in memory_resource.pxd, because memory_resource exports a cpdef function also named get_current_device_resource that is a wrapper around the C++ function.

bdice · 2022-12-13T16:52:49Z

python/rmm/tests/test_rmm.py

-
-    cuda_mr = rmm.mr.CudaMemoryResource()
+@pytest.fixture
+def stats_mr():


Should this be set up as a yield fixture and reset the current device resource afterward?

We already have an autouse function scoped fixture that does that: https://github.com/rapidsai/rmm/blob/branch-23.02/python/rmm/tests/test_rmm.py#L46. I'm guessing that should just work as expected?

bdice · 2022-12-13T16:53:46Z

python/rmm/tests/test_rmm.py

+    except RuntimeError:
+        pass


What's the except/pass here for? Can we call change_current_allocator in an else: block from the first try:?

try: from torch.cuda.memory import change_current_allocator except ImportError: pytest.skip("pytorch pluggable allocator not available") else: change_current_allocator(rmm.rmm_torch_allocator)

Alternatively can we use pytest.importorskip?

Ah I think it's because a RuntimeError is raised if you set the torch allocator twice in the same session. Maybe we should use a session scoped fixture instead?

…-allocator

harrism · 2022-12-13T21:39:14Z

I didn't see any significant documentation of the new functionality added in this PR. Do you plan to add more?

shwina · 2022-12-13T21:53:27Z

Thanks @harrism -- I added a blurb in our README.md, which is where we also have documentation on how to use RMM + Numba or RMM + CuPy

codecov-commenter · 2022-12-13T22:15:00Z

Codecov Report

❗ No coverage uploaded for pull request base (branch-23.02@c5c02fc). Click here to learn what that means.
Patch has no changes to coverable lines.

Additional details and impacted files

@@              Coverage Diff               @@
##             branch-23.02   #1168   +/-   ##
==============================================
  Coverage                ?   0.00%           
==============================================
  Files                   ?       6           
  Lines                   ?     421           
  Branches                ?       0           
==============================================
  Hits                    ?       0           
  Misses                  ?     421           
  Partials                ?       0

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

README.md

harrism

New docs show the power of RMM to bridge the memory needs of RAPIDS, PyTorch (and Cupy, and Numba). Nice!

jakirkham · 2022-12-15T04:19:03Z

It's really great seeing this is now possible! 🎉

VibhuJawa · 2022-12-19T19:06:59Z

Would it be possible to merge this soon. Want to start testing this right away for cugraph_dgl .

vyasr · 2022-12-19T19:16:17Z

Planning to review this later today.

…idsai#1170) Closes rapidsai#1169. Essentially, we are running into the situation described in https://cython.readthedocs.io/en/latest/src/userguide/extension_types.html#disabling-cycle-breaking-tp-clear with `UpstreamResourceAdaptor`. The solution is to prevent clearing of `UpstreamResourceAdaptor` objects by decorating them with `no_gc_clear`. Cython calls out the following: > If you use no_gc_clear, it is important that any given reference cycle contains at least one object without no_gc_clear. Otherwise, the cycle cannot be broken, which is a memory leak. The other object in RMM that we mark `@no_gc_clear` is `DeviceBuffer`, and a `DeviceBuffer` can keep a reference to an `UpstreamResourceAdaptor`. But, an `UpstreamResourceAdaptor` cannot keep a reference to a `DeviceBuffer`, so instances of the two cannot form a reference cycle AFAICT. Authors: - Ashwin Srinath (https://github.com/shwina) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) - Mark Harris (https://github.com/harrism) URL: rapidsai#1170

ajschmidt8 · 2022-12-19T22:13:42Z

Removing ops-codeowners from the required reviews since it doesn't seem there are any file changes that we're responsible for. Feel free to add us back if necessary.

vyasr · 2022-12-20T01:07:08Z

I spent most of my time with this PR today helping Ashwin troubleshoot the test failures and didn't get around to reviewing. Probably won't get back to this until Wednesday or so.

leofang · 2022-12-21T17:32:48Z

python/rmm/_lib/torch_allocator.pyx

+
+
+cdef public void* allocate(
+    ssize_t size, int device, void* stream


Q: device is ignored by design? (I was reviewing cupy/cupy#7210 and noticed this.)

Ah, great catch! This brings out a subtle problem:

In RMM, each device has its own memory resource. Thus, to do the allocation on a specified device with RMM, I would write the torch allocate function like this:

cdef public void* allocate(ssize_t size, int device, void* stream) except * with gil: cdef device_memory_resource* mr = get_per_device_resource(device) return mr[0].allocate(size, <cudaStream_t> stream)

Unforunately, the deallocation function does not accept a device argument, so we cannot retrieve the memory resource that was used for allocation:

void deallocate(void* ptr, ssize_t size, void* stream)

I don't really see a way around this other than for the deallocate signature to include the device argument. cc: @emcastillo

I would love to know too. TBH I am puzzled by PyTorch's (long-time) behavior of asking for device. It should just honor the current device...

I'll submit a follow-up PR adding support for device, once pytorch/pytorch#91398 is merged.

bdice

One suggestion, otherwise LGTM!

bdice · 2023-01-04T23:37:30Z

python/rmm/tests/test_rmm_pytorch.py

+    try:
+        from torch.cuda.memory import change_current_allocator
+    except ImportError:
+        pytest.skip("pytorch pluggable allocator not available")


There's a pytest utility for this if you want to use it. pytest.importorskip

Does that handle importing specific functions?

This is using importorskip for torch generally above. The torch.cuda.memory module has been around for a while. Though the functionality we need from it is pretty new.

Maybe in the future this could require a specific PyTorch version. There doesn't seem to be one yet that has what we need though.

No, pytest.importorskip only handles modules and you have to use attribute accessors to get functions. It's kind of messy. The current solution is probably easier to read, let's keep it as-is.

vyasr

Couple of minor questions, this is awesome!

vyasr · 2023-01-05T00:15:41Z

python/rmm/_lib/torch_allocator.pyx

+) except * with gil:
+    cdef device_memory_resource* mr = get_current_device_resource()
+    return mr[0].allocate(size, <cudaStream_t> stream)
+
+cdef public void deallocate(
+    void* ptr, ssize_t size, void* stream
+) except * with gil:
+    cdef device_memory_resource* mr = get_current_device_resource()
+    mr[0].deallocate(ptr, size, <cudaStream_t> stream)


Why are these gil requiring functions? It seems like it's all pure C code here, no Python objects etc.

To be clear, if they're getting called from GIL-requiring code in PyTorch, so be it. I just don't see a reason that these functions need to explicitly acquire the GIL. If PyT can call these in a nogil context, is there a reason for us not to allow that?

Because the allocate() and deallocate() methods can involve Python operations on Python objects, e.g., in CallbackMemoryResource or FailureCallbackResourceAdaptor.

I was thinking that would be automatically be handled when those callbacks are invoked since those callbacks are stored as Python objects. Those are stored as Python objects in the class, so any interaction with them should reacquire the GIL already, right? I guess the potential issue is that we cast these to void * pointers before passing them to the C++ classes, so at the point of the call we've lost Cython's safety net. Is that right? If so, we should consider (out of scope for this PR of course) inserting the necessary Python C API calls into the relevant rmm C++ classes i.e. in failure_callback_resource_adaptor::do_allocate.

That was my expectation as well. If the callback touches Python objects, shouldn't it be the responsibility of the callback to acquire/release the GIL?

My proposal above was wrong, there's no reason to embed this information in librmm where the callbacks could be anything (not necessarily Python objects). However, it should be the responsibility of the callbacks in rmm's Cython code to acquire the GIL as needed, and we do appear to do this correctly already. The _oom_callback_function used by the FailureCallbackResourceAdaptor acquires the GIL before calling the user-provided callback, as do both the allocate and deallocate callbacks used by the CallbackMemoryResource.

Yup -- the GIL should neither be released in C++, nor can it be released in Python. The Cython "wrapper" functions are what need to take on the responsibility of handling the GIL.

Vyas and I did a bit more exploration of exactly why we need a with gil here and ended up quite deep in the CPython and Cython internals (still without a clear answer though).

The symptom though is clear. If you take the example I have in the PR description:

import rmm import torch torch.cuda.memory.change_current_allocator(rmm.rmm_torch_allocator) base_mr = rmm.mr.CudaMemoryResource() def allocate_func(size): print(f"Allocating {size} bytes") return base_mr.allocate(size) def deallocate_func(ptr, size): print(f"Deallocating {size} bytes") return base_mr.deallocate(ptr, size) rmm.mr.set_current_device_resource( rmm.mr.CallbackMemoryResource(allocate_func, deallocate_func) ) x = torch.tensor([1, 2]).cuda() del x y = torch.tensor([1, 2, 3]).cuda() del y

And raise an error in allocate_func, while removing the with gil, you'll see that the error is uncaught and eventually this segfaults.

python/rmm/_lib/memory_resource.pxd

python/rmm/tests/test_rmm_pytorch.py

vyasr · 2023-01-05T00:28:40Z

python/rmm/tests/test_rmm_pytorch.py

+        from torch.cuda.memory import change_current_allocator
+    except ImportError:
+        pytest.skip("pytorch pluggable allocator not available")
+    change_current_allocator(rmm.rmm_torch_allocator)


How does this function behave if passed None (the case where the torch allocator hasn't been defined)?

Hmm - I wouldn't expect it to be None if we were able to import change_current_allocator, since the existence of change_current_allocator implies that rmm_torch_allocator was defined (although somewhat implicitly: change_current_allocator and CudaPluggableAllocator in PyTorch were introduced together).

Should we also skip this test if rmm.rmm_torch_allocator is None for some reason?

Ah OK I wasn't sure about that, I thought they weren't introduced entirely concurrently. Up to you on the extra skip, it sounds like it would be pedantically correct but not practically necessary.

vyasr · 2023-01-05T00:29:57Z

python/rmm/rmm.py

+try:
+    from torch.cuda.memory import CUDAPluggableAllocator
+except ImportError:
+    rmm_torch_allocator = None


Could this bite a user that accesses this attribute and passes it around thinking that it's fine when it's really just None? It might be safer to override __getattr__ for the module and have it raise an error to prevent the user from accessing this attribute when CUDAPluggableAllocator failed to import.

Could we alternatively pass on ImportError to achieve the same effect as defining that module __getattr__?

I think you'd get close to the same effect, just a slightly less user-friendly version. With a __getattr__ override you could provide a more friendly error message indicating that this happened because the torch allocator failed to import, whereas if you just avoid defining it the user will see an AttributeError without any additional diagnostics and may think it's a bug in rmm.

It's a very minor point though, I'm fine leaving this as is for now and only revisiting in the future if we get a lot of user questions about why the allocator is None.

README.md

vyasr

One outstanding question regarding the GIL, but it's not a blocker. We can address it later if you agree that it's an issue since it's not strictly related to the PyT allocator.

shwina · 2023-01-05T20:47:07Z

/merge

Add RMM PyTorch allocator

f1e9413

github-actions bot added CMake Python Related to RMM Python API labels Nov 29, 2022

Add with gil to enable calling Python code in allocate/deallocate

53a18ab

shwina mentioned this pull request Nov 29, 2022

Test PyTorch's Pluggable CUDA allocator with RMM #1144

Closed

shwina added 3 commits November 29, 2022 13:10

Don't set pytorch allocator by default

aab7e07

Add tests

63f2692

Merge branch 'branch-23.02' of github.com:rapidsai/rmm into add-torch…

3fcd2ec

…-allocator

shwina added feature request New feature or request non-breaking Non-breaking change labels Nov 30, 2022

Styles

6404dac

shwina marked this pull request as ready for review November 30, 2022 21:05

shwina requested review from a team as code owners November 30, 2022 21:05

VibhuJawa mentioned this pull request Dec 5, 2022

[FEA]: Enable RMM PyTorch allocator for cugraph_dgl rapidsai/cugraph#3043

Closed

2 tasks

leofang reviewed Dec 13, 2022

View reviewed changes

bdice requested changes Dec 13, 2022

View reviewed changes

shwina added 3 commits December 13, 2022 15:17

Don't redeclare get_current_device_resource

73e0846

Merge branch 'branch-23.02' of github.com:rapidsai/rmm into add-torch…

dc16ef4

…-allocator

use a session-scoped fixture instead

9c27e85

Add a note on how to use RMM + PyTorch

39251db

harrism reviewed Dec 14, 2022

View reviewed changes

README.md Show resolved Hide resolved

More doc

6713361

harrism approved these changes Dec 14, 2022

View reviewed changes

Add new test_rmm_pytorch.py. Move fixtures to conftest.py.

a53ce95

shwina requested a review from a team as a code owner December 19, 2022 20:16

github-actions bot added ci conda labels Dec 19, 2022

shwina force-pushed the add-torch-allocator branch from cde2d8a to 741a1df Compare December 19, 2022 20:25

github-actions bot removed conda ci labels Dec 19, 2022

ajschmidt8 removed the request for review from a team December 19, 2022 22:13

Delete __init__.py in tests/

4534bca

Merge branch 'branch-23.02' into add-torch-allocator

dbf43b7

leofang reviewed Dec 21, 2022

View reviewed changes

styles

db5fc52

jakirkham requested a review from bdice January 4, 2023 19:19

bdice approved these changes Jan 4, 2023

View reviewed changes

vyasr requested changes Jan 5, 2023

View reviewed changes

shwina added 2 commits January 5, 2023 09:17

Correctly type stream parameter for allocate/deallocate

2b343c2

Address reviews

232fbd0

vyasr approved these changes Jan 5, 2023

View reviewed changes

rapids-bot bot merged commit e324ace into rapidsai:branch-23.02 Jan 5, 2023

VibhuJawa mentioned this pull request Jan 16, 2023

[FEA] PyTorch and RMM sharing memory pool #501

Closed

harrism mentioned this pull request Jan 24, 2023

Common CUDA allocator? dmlc/dlpack#33

Closed

shwina mentioned this pull request Feb 6, 2023

Add tests and document error handling for CUDAPluggableAllocator pytorch/pytorch#94166

Closed

wence- mentioned this pull request Dec 12, 2023

[BUG] Unexpected memory usage on GPU0 #1405

Closed



		cdef extern from "rmm/mr/device/per_device_resource.hpp" namespace "rmm" nogil:
		cdef device_memory_resource* get_current_device_resource \



		cdef public void* allocate(
		ssize_t size, int device, void* stream

Add RMM PyTorch allocator #1168

Add RMM PyTorch allocator #1168

Conversation

shwina commented Nov 29, 2022 • edited Loading

leofang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bdice left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

harrism commented Dec 13, 2022

shwina commented Dec 13, 2022

codecov-commenter commented Dec 13, 2022 • edited Loading

Codecov Report

harrism left a comment

Choose a reason for hiding this comment

jakirkham commented Dec 15, 2022

VibhuJawa commented Dec 19, 2022

vyasr commented Dec 19, 2022

ajschmidt8 commented Dec 19, 2022

vyasr commented Dec 20, 2022 • edited Loading

Choose a reason for hiding this comment

shwina Dec 21, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bdice left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bdice Jan 4, 2023 • edited Loading

Choose a reason for hiding this comment

vyasr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vyasr left a comment

Choose a reason for hiding this comment

shwina commented Jan 5, 2023

shwina commented Nov 29, 2022 •

edited

Loading

codecov-commenter commented Dec 13, 2022 •

edited

Loading

vyasr commented Dec 20, 2022 •

edited

Loading

shwina Dec 21, 2022 •

edited

Loading

bdice Jan 4, 2023 •

edited

Loading