-
Notifications
You must be signed in to change notification settings - Fork 168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA <memory_resource> Extension Proposal #967
Comments
There's a usability issue with the current design of Let's say I have a function that expects a resource that can allocate memory accessible from the device. For the sake of argument, let's say I don't care if it's managed, regular device memory, or pinned. All I care about is being accessible from the device. I might write my function like:
But let's say I had a managed memory resource like:
I wouldn't be able to pass a pointer to an instance of Furthermore, it wouldn't be appropriate to have The final alternative might be to make it a template (or 3 explicit overloads):
This works, but if I want The current design is too rigid. As @harrism put it, the author of a memory resource wants to be able to specify the |
An alternative option is to add something I call enum class memory_caps : unsigned {
host_accessible = 0x01,
device_accessible = 0x02,
cross_device_accessible = 0x04,
oversubscribable = 0x08,
resident = 0x10,
has_host_storage = 0x20,
has_device_storage = 0x40,
};
constexpr memory_caps operator|(memory_caps a, memory_caps b) {
return (memory_caps)((unsigned)a|(unsigned)b);
}
constexpr memory_caps operator&(memory_caps a, memory_caps b) {
return (memory_caps)((unsigned)a&(unsigned)b);
}
enum class memory_kind {
host,
device,
pinned,
managed
};
template <memory_kind kind>
struct memory_kind_caps;
struct base_resource {
virtual void do_my_job() = 0;
};
template <memory_caps>
struct resource_view;
template <memory_kind _kind>
struct memory_resource : private base_resource {
static constexpr memory_kind kind = _kind;
void my_job() { do_my_job(); }
template <memory_caps view_caps>
friend class resource_view;
};
template <typename resource>
struct memory_resource_traits {
static constexpr memory_caps caps = memory_kind_caps<resource::kind>::value;
static constexpr memory_kind kind = resource::kind;
};
template <memory_caps caps>
struct resource_view {
template <memory_kind kind, typename = std::enable_if_t<(caps & memory_resource_traits<memory_resource<kind>>::caps) == caps>>
resource_view(memory_resource<kind> *resource) {
this->resource = resource;
}
template <memory_caps other_caps,
typename = std::enable_if_t<(other_caps & caps) == caps>>
resource_view(resource_view<other_caps> other) {
this->resource = resource;
}
void my_job() {
resource->do_my_job();
}
private:
base_resource *resource;
};
template <>
struct memory_kind_caps<memory_kind::host> : std::integral_constant<
memory_caps,
memory_caps::host_accessible |
memory_caps::oversubscribable |
memory_caps::has_host_storage> {};
template <>
struct memory_kind_caps<memory_kind::pinned> : std::integral_constant<
memory_caps,
memory_caps::host_accessible |
memory_caps::device_accessible |
memory_caps::resident |
memory_caps::has_host_storage> {};
template <>
struct memory_kind_caps<memory_kind::device> : std::integral_constant<
memory_caps,
memory_caps::device_accessible |
memory_caps::resident |
memory_caps::has_device_storage>
{};
template <>
struct memory_kind_caps<memory_kind::managed> : std::integral_constant<
memory_caps,
memory_caps::host_accessible |
memory_caps::device_accessible |
memory_caps::oversubscribable |
memory_caps::has_host_storage |
memory_caps::has_device_storage>
{}; This structure allows the client code which cares for a specific combination of capabilities to declare this by expecting a Usage example: #include <iostream>
struct managed_resource : memory_resource<memory_kind::managed> {
private:
void do_my_job() override { std::cout << "Managed" << std::endl; }
};
struct device_resource : memory_resource<memory_kind::device> {
private:
void do_my_job() override { std::cout << "Device" << std::endl; }
};
void foo(resource_view<memory_caps::host_accessible | memory_caps::has_host_storage> view) {
view.my_job();
}
void goo(resource_view<memory_caps::device_accessible | memory_caps::oversubscribable> view) {
view.my_job();
}
void boo(resource_view<memory_caps::device_accessible> view) {
view.my_job();
}
void loo(resource_view<memory_caps::device_accessible | memory_caps::oversubscribable> view) {
boo(view);
}
int main() {
managed_resource mr;
device_resource dr;
mr.my_job();
foo(&mr);
goo(&mr);
loo(&mr);
//foo(&dr); // compilation error
//goo(&dr); // compilation error
return 0;
} |
fyi, this has become incredibly out of date with the current design. I will be working on updating the design document. |
Note: This proposal is defunct. See the evolved designed that was merged here: NVIDIA/libcudacxx#309 |
Note: This proposal is defunct. See the evolved designed that was merged here: NVIDIA/libcudacxx#309
CUDA
<memory_resource>
Extension Proposal (DEFUNCT)PR Tracker
cuda::
extension of <memory_resource> for stream ordered allocation libcudacxx#105 Implementmemory_resource
,stream_ordered_memory_resource
,stream_view
, andmemory_kind
cuda::memory_resource
cuda::stream_ordered_memory_resource
Motivation
Performance-sensitive applications that make frequent dynamic memory allocations often find allocating memory to be a significant overhead. CUDA developers are even more acutely aware of the costs of dynamic allocation due to the relatively higher cost of
cudaMalloc/cudaFree
compared to standardmalloc/free
. As a result, developers devise custom, high-performance memory allocators as optimized as the application the allocator serves. However, what works well for one application will not always satisfy another, which leads to a proliferation of custom allocator implementations. Interoperation among these applications is difficult without an interface to enable sharing a common allocator.In Standard C++,
Allocator
has traditionally provided this common interface. C++17 introduced<memory_resource>
and thestd::pmr::memory_resource
abstract class that defines a minimal interface for (de)allocating raw bytes and sits belowAllocator
. This optionally polymorphic interface provides a standard way to define, expose and share custom memory allocation.However, the
std::pmr::memory_resource
interface is insufficient to capture the unique features of the CUDA C++ programming model. For example, Standard C++ only recognizes a single, universally accessible memory space; whereas CUDA C++ applications may access at least four different kinds of dynamically allocated memory. Furthermore, CUDA's "stream"-based asynchronous execution model was extended in CUDA 11.2 with the addition ofcudaMallocAsync
andcudaFreeAsync
1, which provide stream-ordered memory allocation and deallocation. Therefore, there is a need for a common allocator interface similar tostd::pmr::memory_resource
that accounts for the unique features of CUDA C++.[1]: Note that
cudaMallocAsync
obviates neither the need for custom, CUDA-aware allocators, nor the need for a common allocation interface. There will never be one allocator that satisfies all users. Furthermore, a common interface allows composing and layering utilities like logging, limiting, leak checking, and tracking.Description
We propose extending
<memory_resource>
to provide a common memory allocation interface that meets the needs of CUDA C++ programmers.We chose
<memory_resource>
as the basis for a CUDA-specific allocator interface for several reasons:<memory_resource>
is the direction taken by Standard C++ for custom, stateful allocators. An allocator interface with a common look and feel will ease working between Standard and CUDA C++. For more information on<memory_resource>
see here and here.The RAPIDS Memory Management library has had three years of success using its
rmm::device_memory_resource
interface based onstd::pmr::memory_resource
.Likewise, Thrust has had similar success with its
thrust::mr::memory_resource
interface.Given the direction of Standard C++ and the success of two widely used CUDA libraries with a similar interface,
<memory_resource>
is the logical choice.This proposal includes the addition of the following to libcu++:
cuda::memory_kind
A scoped enumerator demarcating the different kinds of dynamically allocated CUDA memory.
This is intended to be similar to the existing
thread_scope
enum.cuda::stream_view
Similar to
std::span
orstd::string_view
,cuda::stream_view
is a strongly typed, non-owning, view type forcudaStream_t
. This type provides a more type-safe C++ wrapper aroundcudaStream_t
and serves as the input argument type for any libcu++ API that takes a CUDA stream.cuda::memory_resource
The
cuda::memory_resource
class template is the abstract base class interface akin tostd::pmr::memory_resource
with two main differences:The
Kind
template parameter determines thememory_kind
allocated by the resource.The
Context
template parameter determines the "execution context" in which memory allocated by the resource can be accessed without synchronization. By default, theContext
is theany_context
tag type that indicates storage may be accessed immediately on any thread or CUDA stream without synchronization.The purpose of the
Context
template parameter is to allow for more generic allocation semantics. For example, consider a "stream-bound" memory resource where allocated memory may only be accessed without synchronization on a particular stream bound at construction:cuda::pmr_adaptor
cuda::memory_resource
is similar tostd::pmr::memory_resource
, but they do not share a common inheritance hierarchy, therefore an object that derives fromcuda::memory_resource
cannot be used polymorphically as astd::pmr::memory_resource
, i.e., acuda::memory_resource
derived type cannot be passed to a function that expects astd::pmr::memory_resource
pointer or reference. However, there may be situations where one wishes to use acuda::memory_resource
derived type as if it were astd::pmr::memory_resource
derived type. Thecuda::pmr_adaptor
class is intended to provide this functionality by inheriting fromstd::pmr::memory_resource
and adapting an appropriatecuda::memory_resource
.cuda::stream_ordered_memory_resource
The
cuda::stream_ordered_memory_resource
class template is the abstract base class interface for stream-ordered memory allocation. This is similar tocuda::memory_resource
butallocate_async
anddeallocate_async
both take a stream argument and follow stream-ordered memory allocation semantics as defined bycudaMallocAsync
.Concrete Resource Implementations:
Just as
<memory_resource>
provides concrete, derived implementations ofstd::pmr::memory_resource
, libcu++ will provide the following:cuda::new_delete_resource : public cuda::memory_resource<memory_kind::host>
::operator new()
/::operator delete()
for allocating host memorycuda::device_resource : public cuda::memory_resource<memory_kind::device>
cudaMalloc/cudaFree
for allocating device memorycuda::unified_resource : public cuda::memory_resource<memory_kind::unified>
cudaMallocManaged/cudaFree
for unified memorycuda::pinned_resource : public cuda::memory_resource<memory_kind::pinned>
cudaMallocHost/cudaFreeHost
for page-locked host memorycuda::async_device_resource : public cuda::stream_ordered_memory_resource<memory_kind::device>
cudaMallocAsync/cudaFreeAsync
for device memoryOther resource implementations may be added as deemed appropriate.
cuda::
Namespace PolicyThe current policy of libcu++ is that everything in the
cuda::
namespace must be heterogeneous, i.e.,__host__ __device__
. The facilities described above in<cuda/memory_resource>
are intended to be host-only at this time. Therefore, we propose to modify the policy to allow host-only constructs incuda::
. Device-only constructs will still be disallowed incuda::
. Any device-only construct would go intocuda::device::
.Future Work
Future work will include the design of allocators similar to
std::pmr::polymorphic_allocator
to work withcuda::memory_resource
andcuda::stream_ordered_memory_resource
.Likewise, containers that work with
cuda::memory_resource
andcuda::stream_ordered_memory_resource
will be future work.The text was updated successfully, but these errors were encountered: