-
Notifications
You must be signed in to change notification settings - Fork 449
[FEA] Multi-buffer copy algorithm #297
Comments
My initial thoughts:
I agree that the input/output ranges must be memory buffers and not iterators, but ideally the outer dimension could be an iterator and the inner dimension could just be "pointer-like". For example,
should work ideally. If we do support this, we'll need to make sure that we have a good diagnostic when a buffer isn't convertible to a raw pointer.
I may be missing something, but since this is a bitwise memcpy, I don't think alignment matters. The memcpy implementation should determine the best alignment/word size to use for copying, and break up the copies into appropriate chunks. |
I like
Done.
That was a mistake.
Agreed, I think this is easy enough to static_assert with appropriate traits (
It matters for getting good performance. In the worse case, the memcpy has to assume 1B alignment and use 1B load/stores, or introspect the pointers to determine the alignment and decide what size load/stores can be used. Introspecting the pointer can generate a lot of extra code that harms perf, so if you can statically specify the alignment, it is much better for perf. I've updated the issue description based on your feedback. |
Makes sense.
I'm not sure there's a good way to do this. If this is for a static optimization, all of the alignments would need to be specified as template parameters. This would be quite a burden, and would require a unique template instantiation of the entire algorithm for each unique set of alignments. A more feasible compromise might be to add an extra argument that's essentially a Would that be suitable for your usecase? |
Alternatively, it might make sense to introduce a tagged pointer type that carries alignment info. It'd still be a headache from a template standpoint, but it would be a nicer interface. |
Agreed, that's why I don't think it's really a solvable problem without making the algorithm variadic.
I think this is the only reasonable, non-variadic solution. Though I don't think it requires an extra |
Good point -- that would be ideal. Since we're adding a libcu++ dependency soon this should be totally doable. |
We might consider a generalized version of this API. The original issue looks like this. It's helpful to have a mapping for ranges within sources and destinations. In this case, we can introduce BatchMemcpyGather and BatchMemcpyScatter facilities. I suppose a fixed mapping group size per source/destination pair is sufficient. It's equal to 64 bytes for the int32 arrays above. |
I'd like to see a few things happen here:
|
How do we generally feel about taking an extra parameter (
Other CUB algorithms currently have I expect |
Can you elaborate on what the temp storage is used for in this case? Could It should be fine to include that as an optimization, but I'd still like to write generic usages where the upper bound is unknown. |
Actually, when I first envisioned this API, I was thinking the size iterator would be host accessible. But it's not obvious to me if that's the right decision or not. |
Thanks for clarifying, @jrhemstad. I'm inclined to not make it a requirement that the iterators are accessible from the host as well. Iirc, all iterators in CUB are currently only accessed from the device. I also think that there's use cases where this will be an algorithm that will be called in succession of another algorithm that has previously run on the GPU. If it'd be a requirement to have the size iterator be host-accessible too, then this would imply a On another note, I think I have found a viable, load-balanced solution that makes the |
This feature request has been addressed by PR #359 that is now merged. |
Excited to see this has landed! 🥳 Is the idea still to include this in 2.1.0? If so, when is that release scheduled? Just trying to get an idea for planning purposes. Thanks! 🙏 |
I have
N
input buffers that I want to copy toN
output buffers. I could sequentially callcudaMemcpyAsync
N
times, but in most cases it would be faster to launch a single kernel that performs allN
copies.I think such a primitive would be a good fit as a CUB algorithm.
I imagine the API would be something like:
There's some issues with this API I haven't figure out yet:
DeviceSegmentedRadixSort
, I think the in/out need to be raw pointers. Otherwise, how do you accept multiple iterators of potentially different types? Make the algorithm variadic? Maybe.aligned_size_t
, but how do you specify different alignments for each buffer?Related: rapidsai/cudf#7076
The text was updated successfully, but these errors were encountered: