-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
abstract the memory access inside kernels #38
Comments
Reviving this ancient issue.
I agree! SYCL introduced memory accessors which hide the pointers quite nicely and provide a clean API for memory access. Maybe we can adapt this concept. CC @bernhardmgruber, this might tie into his work. |
I've been giving this some thought since this is (kind of...) a requirement for the SYCL backend. Here are some ideas I'd like your opinions on:
enum class viewMode
{
ReadOnly,
WriteOnly,
ReadWrite
};
enum class viewTarget
{
HostMemory,
GlobalMemory,
ConstantMemory,
SharedMemory
};
template <typename TElem, typename TDim, typename TIdx, alpaka::viewMode mode, alpaka::viewTarget target>
class view
{
using value_type = /* TElem or TElem const */;
using reference = /* TElem& or TElem const& */;
using pointer = /* Backend-defined */; // and const_pointer
using iterator = /* Backend-defined */; // and const_iterator
using reverse_iterator = /* Backend-defined */; // and const_reverse_iterator
/* Constructors, copy / move operators, destructor */
reference operator[](/* ... */);
Vec<TDim> get_extent();
pointer get_pointer(); // if you really need a raw pointer
auto get_byte_distance(); // = pitch
/* more utility functions */
}; What do you think? |
@bernhardmgruber , this is probably a critical interface discussion that needs careful discussion. LLAMA should be able to hook into this seamlessly, while Alpaka should work comfortably without the need for LLAMA here. |
What is the rational for renaming the function? I think
That overload should just take a range. And I think it is worthwhile do distinguish between the iterator concepts input, forward, random access and contiguous (new in C++20) iterators.
This overload is a special case and might deserve its own, differently named function, e.g.
I think we need to be careful to not reinvent the wheel. Some thought has been poured into viewing memory or parts of it. That's why C++17 got
I think we only need one API to create a sub buffer. So either 1.4 or 2.
What is the benefit of this new API? I can call
I think I think a first step might be to implement the automatic conversion of alpaka buffers on the host side into pointers at the kernel function interface inside
Does that fully replace shared variables of statically known size? I like alpaka's
Be careful, static shared memory (that is with compile time known size) might offer better optimization opportunities. I would not fully get rid of this feature.
I know these modes from OpenCL. I guess they are in SYCL as well? Because we can express ReadOnly and ReadWrite easily with
Let's light up the bomb: how does unified memory fit into this picture? I think unified memory is getting increasingly relevant. Also because it is the default model for
Yes, we really need the pointer! More on that later.
I do not like the name. This function also only makes sense for 2D buffers. So maybe conditionally provide it? What about 3D buffers?
Some general thoughts which we also discussed offline already: I think there are two concerns involved:
There are various ways to implement 1 and 2. Why does this matter? Because there needs to be an interface between 1 and 2. There needs to be a way to communicate storage created by 1 to a facility for interpretation 2. Surprise, surprise, the easiest such interface is a pointer and a length. And this is such a universal protocol, because if I can extract a pointer out of a buffer, I can wrap a LLAMA goes one little step further, because it allows to create data structures accross multiple buffers. But fundamentally, a LLAMA view is built on top of a statically sized array of storage regions. These storage regions are untyped, i.e. spans of Example: void kernelFunc(std::byte* data, int width, int height) {
auto mapping = ...; // configure the data structure
llama::View view(mapping, {data});
// access
float v = view(x, y)(Tag1{}, Tag2{});
} |
Because there is not necessarily an allocation taking place: Both sub-buffer creation and taking ownership of host memory don't involve any allocation.
You mean like
What would be the benefit here? We don't really care about the original host container, this is just for buffer initialization.
SYCL can do this, as do the CPU backends. CUDA seems to be the exception here, unless the host pointer was allocated with
I wasn't aware of
I agree. I'm leaning towards
Again, I wasn't aware of
I'm open for alternative names. I was mainly basing this on SYCL accessors where
I believe the interface is easier to use if we use
Yes. Reason: It is impossible to implement this (in reasonable time) for the SYCL backend.
I like this.
I'm not certain that we support unified memory in alpaka or plan to do so as this goes against our "everything explicit" policy.
It is the same for 3D buffers (since 3D buffers are just a stack of 2D buffers). Maybe use
As discussed offline: The pointer interface only works easily if the chunk of raw memory is actually contiguous. This assumption fails as soon as 2D/3D memory on GPUs is involved (which is why we need the row distance / pitch). Now you can also introduce FPGAs where you can reconfigure your elements (1,2,3,4,5,6,7,8) to live in four different memory blocks in the order of (1,3) (2,4) (5,7) (6,8).
This looks very nice and I definitely see a common meta-language here we need to flesh out. |
No. I mean
Well, you are partially right. What matters if you need to copy element wise or if you can just bulk copy the bits. I think we should just ignore the iterator concept and default to
So if I interpret this correctly, an alpaka program that uses host pointer adoption can either not be run using CUDA or needs to do an explicit copy of the host pointer's memory. Honstly, I think we should skip the host pointer version for now. If I want to initialize my buffer from an existing memory region, I can just call overload 1.2 with the iterators/range.
Have a look, it might influence your design. But it is probably not the full solution if your view still needs to govern address spaces.
The interface is definitely more bloated. This is what I am afraid of. Here is the vectorAdd alpaka example: Now: auto const taskKernel(alpaka::createTaskKernel<Acc>(
workDiv,
kernel,
alpaka::getPtrNative(bufAccA),
alpaka::getPtrNative(bufAccB),
alpaka::getPtrNative(bufAccC),
numElements)); With your auto const taskKernel(alpaka::createTaskKernel<Acc>(
workDiv,
kernel,
alpaka::require(bufAccA),
alpaka::require(bufAccB),
alpaka::require(bufAccC),
numElements)); With my proposed implicit recognition of buffers: auto const taskKernel(alpaka::createTaskKernel<Acc>(
workDiv,
kernel,
bufAccA,
bufAccB,
bufAccC,
numElements)); Regarding comprehendability: OpenCL has cl::Buffers on the host size and pointers at the kernel interface. That usually does not confuse people ;)
That is an opinion and I am of the opposite one, but not strongly.
Unified memory has different performance charakteristics. It can be much slower or much faster than the traditional device side buffers, depending on how much of the memory is touched by a kernel. So it is not a question of everything explicit or implicit. It is a question if alpaka wants to support that. But if we are going to redesign how buffers work, we should at least think about this question and if and how we want to address unified memory.
But doesn't a 3D buffer have 2 pitches? I think I can live with a pitch of 0 for 1D buffers. 0D buffers probably do not occur that often ;)
AFAIK 2D/3D GPU buffers are still contiguous. They can just contain additional padding. So a pointer is still fine ;)
If
Thinking about it, LLAMA could probably also just work with SYCL accessors: void kernelFunc(sycl::accessor<std::byte, 1, sycl::access::mode::read_write, sycl::access::target::global> data, int width, int height) {
auto mapping = ...; // configure the data structure
llama::View view(mapping, {data});
// access
float v = view(x, y)(Tag1{}, Tag2{});
} I see no reason, why that should not compile or at least be easy to get compiling. |
It's great to see this discussion. Be aware of concepts like non-contiguous representation of data in memory and implicit concepts like unified memory. It's a jungle out there. Keep going, you're doing great! |
My two cents on the matter.
|
@bernhardmgruber and I just had a VC where we also addressed this issue. A short summary:
Regarding @sbastrakov's points:
We also talked about that. I agree that sub-buffers are a confusing term in this sense. @bernhardmgruber proposed that we do slicing, subviews and so on exclusively on views and not on buffers to make this distinction clearer.
My idea was that we remove static shared memory completely and just rely on dynamic shared memory. But I agree with @bernhardmgruber's objection that this would remove a lot of convenience from alpaka. |
Thank you @j-stephan for the good summary.
I fully agree. A buffer owns a region of memory with a given size. I wanted to go even further and require it to be contiguous, but Jan told me that for FPGAs this might not be the case.
You are right. The use of existing storage to create a buffer does violate the above meaning of a buffer. However, there are APIs that allow that. OpenCL has I suggested to Jan to name this functionality differently, e.g.
I agree as well. And as Jan said, I think we should allow slicing only on views into buffers, so they always stay non-owning. |
I only skipped through the thread so I am not sure if I got everything but here are my notes on this topic which explain the current state. alpaka has two concepts for memory: A The All in all my opinion is that most of the things requested in the numbered list at the top is already there (except renaming) but in a more generic/abstract way. Enforcing a specific implementation of a What I originally wanted to document in this ticket is the need I saw for some better memory abstraction in the kernel where it is accessed:
|
@bernhardmgruber to clarify my point about conversion of pointers to buffers and vice versa. I am not against that in principle, and this operation sometimes makes sense indeed. I was merely against doing so implicitly and thus causing uncertainty and errors regarding who owns the data. Having an explicit constructor or API function to do so is no problem with me as long as it's consistent with the meaning we (will) put on buffers, pointers, views. |
Thank you for explanation @BenjaminW3!
That is a design decision which I might not have done. So this means a concrete buffer implementation is also a concrete view? So if we change the requirement that kernel arguments must be views now instead of plain pointers, this means we need to pass the alpaka kernels directly into the kernel function? This sounds pretty mad to me: auto buffer = alpaka::allocBuf<float>(dev, count); // I forgot the correct args, sorry
auto taskKernel = alpaka::createTaskKernel<Acc>(workDiv, kernel, buffer, count); // pass buffer
...
void kernelFunc(alpaka::buffer<float>& data, int count) { // receive view (which is buffer)
...
float v = data(i)(Tag1{}, Tag2{});
} I think we had this case in some unit test at some point and that caused issues with VS 2019 and @psychocoderHPC and me decided to not allow alpaka buffers inside kernels. The type passed into the kernel function needs to be a more "lightweight" type. I think we might talk about a different type of view here. One of the motivations for this different view type stems from the need for address space qualifiers in SYCL. So we will have an Furthermore, I think this view type could be a single type provided by alpaka. Because now we also use the same type for all backends at the kernel interface, which is a Maybe we should change our naming and call this type of view really just accessor, the same as in SYCL?
I think @j-stephan has the same goal here. But also potentially adding the semantic of address space qualifiers.
This is solved by LLAMA already, although I did not yet promote this library for this use case. |
Let me clarify as well: auto buffer = alpaka::allocBuf<float>(dev, count);
auto taskKernel = alpaka::createTaskKernel<Acc>(workDiv, kernel, buffer, count); // pass buffer
...
void kernelFunc(T* data, int count) { // receive ptr
...
} Nothing else will work implicitely. This should NOT work: auto buffer = alpaka::allocBuf<float>(dev, count);
T* data = buffer; // madness The other way around, we have the new functionality proposed by @j-stephan: T* p= ...; // existing data
auto buffer = alpaka::allocBuf<float>(dev, count, p); // 1. copies data from p
auto buffer = alpaka::adoptBuf<float>(dev, count, p); // 2. uses T's storage Feature 1 is reasonable I think. Feature 2 is inspired by SYCL's ability to reuse host memory. I would skip this feature for now. |
Yes, we may not want to copy alpaka The question is what the type and name of a lighweight view that is used to access memory within a kernel is. Accessor sounds good to me. It should be easy to write a trait which converts an arbitrary |
I was not able to follow the full discussion but I try to read all soon. I would like to point all to the Mephisto buffers. |
As far as I understood it, that linked Mephisto "device buffer" looks more or less like a |
Found an example that surprised me and is related to the issue: // the dim of hostMen is 3 dimensional with the sizes (1, 1, n)
TRed* hostNative = alpaka::mem::view::getPtrNative(hostMem);
for(Idx i = 0; i < n; ++i)
{
// std::cout << i << "\n";
hostNative[i] = static_cast<TRed>(i + 1);
} We have a 1-dimensional access to a 3D memory. The official example is similar: // hostBuffer is 3 dimensional
Data* const pHostBuffer = alpaka::getPtrNative(hostBuffer);
// This pointer can be used to directly write
// some values into the buffer memory.
// Mind, that only a host can write on host memory.
// The same holds true for device memory.
for(Idx i(0); i < extents.prod(); ++i)
{
pHostBuffer[i] = static_cast<Data>(i);
} This only works because we expect a certain memory layout. But it could also be possible that the data contains a pitch. Then we don't have a memory violation, but we have data in the wrong place and the result will be wrong. I know there is a function to get the pitch, but as a user I don't want to handle that. I would rather use an access operator like |
I agree, such code samples assume linearized storage without pitches. Which is currently true, but maybe we don't want to rely on it, |
Thanks for starting the discussion about the memory topic! 👍
A true or false in the factory points to missing policies. IMO the memory
This is a window and should be clearly identifiable as a view. suggestion:
A slice is more complex than 1.4. You can describe that only each second element is selected.
Currently, we have a very relaxed concept of host in alpaka. "Everything which is not bonded to a device or accelerator" For me it feels not correct to have something like that. IMO if we define devices and round up how memory is connected to devices, platforms, ... you would always have a device where you operate on, even if it is implicit given (what is currently our "HOST" device). I think that we allow using memory without setting first properties where it lives, how to use it, is not the best way and lead into the requirement of "HOST" interfaces to be able to name the not consequent interface somehow.
For me the question is, do we like to give some kind of views/buffer to the device and create the "iterator" later on the device or do we like to pass always an iterator to the device. Even if we do the second way in all our projects I think passing a lightweight buffer/view to the device and create the iterator on the device can have a lot of benefits. Creating a device object on device can much better handle device specifics e.g. use macros like
What do you have in mind how shared memory can be created on device if it is removed. Please keep in mind that CUDA dynamic shared memory is not equivalent to "static" shared memory. For "static" shared memory the compiler can during the compile use knowledge about the occupancy for the target device based on the shared memory usage. This will effect the register usage. |
Even unified memory belongs to a device, or has a location but can be accessed from multiple devices. So we need still a way to describe the ownership and that's now I think what you like to point out: the visibility, location, ... |
IMO this should get a high priority. Unified memory is simplifying the programming a lot, gives you benefits e.g. oversubscribing memory, zero memory copies, ... |
I think |
auto const taskKernel(alpaka::createTaskKernel( With my proposed implicit recognition of buffers: auto const taskKernel(alpaka::createTaskKernel(
|
The disadvantage of pointers is that you lose meta/policy information about the memory. This means to write a fast |
Yesterday we had a longer video conference about this issue. A few points from my notes: Current shortcomings
Ideas
Concerning llama
Next steps
Feel free to expand on this if I forgot something. |
Regarding LLAMA interop, I formulated that into a concept: https://github.com/alpaka-group/llama/blob/develop/include/llama/Concepts.hpp#L24 For now I just require the type used for storage by a LLAMA view to be bytewise addressable. This is fulfilled by an alpaka buffer containing |
We need to discuss this further! I think you are making a big domain error here! The hardware has an inherent concept of what memory looks like (usually like a continuous 1D array). But tjis concept might vary. Alpaka has an N-D index domain for trends. It is appealing to bring both tightly together. But this is not a good idea for the future. Don't confuse the memory concept of a hardware with the memory representation of a data type. Likewise, don't confuse the memory layout of a datatype with its user side layout. Finally, always remember that algorithms provide (and somewhat represent) access patterns to data types! |
We will implement this for alpaka 0.7.0. Assigned to @j-stephan. |
Assigning to @bernhardmgruber as discussed in today's VC. |
We now have accessors inside |
One thing that accessors are unsuitable for are data structures like linked lists or trees (thanks to @fwyzard for mentioning this). For these we should probably keep pointers unless we want to point everyone to LLAMA. @bernhardmgruber What are your thoughts on this matter? |
Keep pointers, they are a powerful escape hatch when a view to an array does not cut it. I have seen a bit of device pointer arithmetic on the host before passing the pointer to a kernel in a different project. That would not at all be possible with accessors. |
We are going to keep pointers for non-contiguous data and we have |
Just for completeness, the accessors implemented based on the discussion of this thread, have been removed again by #2054. |
There should not be direct access to memory buffers.
This always implies knowledge about the memory layout (row or col) which is not necessarily correct on the underlying accelerator.
The text was updated successfully, but these errors were encountered: