[FEA] Multi-buffer copy algorithm #297

jrhemstad · 2021-05-10T22:01:22Z

I have N input buffers that I want to copy to N output buffers. I could sequentially call cudaMemcpyAsync N times, but in most cases it would be faster to launch a single kernel that performs all N copies.

I think such a primitive would be a good fit as a CUB algorithm.

I imagine the API would be something like:

template <typename InputBufferIt, typename OutputBufferIt, typename BeginSizeIteratorT, typename EndSizeIteratorT>
BatchMemcpy(void *d_temp_storage, size_t &temp_storage_bytes, InputBufferIt first_input_buffer, InputBufferIt last_input_buffer, BeginSizeIteratorT first_buffer_size, OutputBufferIt first_output_buffer){
   static_assert( std::is_pointer_v< std::iterator_traits<InputBufferIt>::value_type > );
   static_assert( std::is_pointer_v< std::iterator_traits<OutputBufferIt>::value_type > );
...
}

There's some issues with this API I haven't figure out yet:

I don't think the input/output can/should be iterators. Like DeviceSegmentedRadixSort, I think the in/out need to be raw pointers. Otherwise, how do you accept multiple iterators of potentially different types? Make the algorithm variadic? Maybe.
The sizes of each buffer is an iterator to allow using something like aligned_size_t, but how do you specify different alignments for each buffer?

Related: rapidsai/cudf#7076

The text was updated successfully, but these errors were encountered:

alliepiper · 2021-05-10T23:05:23Z

My initial thoughts:

I'd go with Memcpy instead of Copy to reinforce that this is a bitwise copy between raw memory segments (as opposed to invoking C++ copy operators, etc).
After thinking about this some more, this isn't a Segmented operation, just a batch operation. The CUB Segmented algorithms all assume a single input buffer broken up into segments, indexed by offset. This uses multiple disjoint buffers with associated sizes. So maybe BatchMemcpy?
Use different template types for the begin/in/output size iterators (this has bitten users in the segmented algorithms, see Allow segmented problems to have different types for offset iterator #229 / Allow segmented problems to have different types for offset iterators. #291).
What does output_start_sizes represent? Shouldn't everything about output_buffers be deducible from the other args?

I don't think the input/output can/should be iterators. Like DeviceSegmentedRadixSort, I think the in/out need to be raw pointers. Otherwise, how do you accept multiple iterators of potentially different types? Make the algorithm variadic? Maybe.

I agree that the input/output ranges must be memory buffers and not iterators, but ideally the outer dimension could be an iterator and the inner dimension could just be "pointer-like". For example,

std::vector<thrust::device_pointer<int>> input = ...;
std::vector<thrust::device_pointer<int>> output = ...;
BatchMemcpy(..., input.begin(), ..., output.begin(), ...);

should work ideally. If we do support this, we'll need to make sure that we have a good diagnostic when a buffer isn't convertible to a raw pointer.

The sizes of each buffer is an iterator to allow using something like aligned_size_t, but how do you specify different alignments for each buffer?

I may be missing something, but since this is a bitwise memcpy, I don't think alignment matters. The memcpy implementation should determine the best alignment/word size to use for copying, and break up the copies into appropriate chunks.

jrhemstad · 2021-05-10T23:13:48Z

I like BatchMemcpy.

Use different template types for the begin/in/output size iterators

Done.

What does output_start_sizes represent?

That was a mistake.

ideally the outer dimension could be an iterator and the inner dimension could just be "pointer-like"

Agreed, I think this is easy enough to static_assert with appropriate traits (is_pointer may not be sufficient for Thrust fancy pointers).

I may be missing something, but since this is a bitwise memcpy, I don't think alignment matters.

It matters for getting good performance. In the worse case, the memcpy has to assume 1B alignment and use 1B load/stores, or introspect the pointers to determine the alignment and decide what size load/stores can be used. Introspecting the pointer can generate a lot of extra code that harms perf, so if you can statically specify the alignment, it is much better for perf.

I've updated the issue description based on your feedback.

alliepiper · 2021-05-11T16:43:51Z

if you can statically specify the alignment, it is much better for perf.

Makes sense.

how do you specify different alignments for each buffer?

I'm not sure there's a good way to do this. If this is for a static optimization, all of the alignments would need to be specified as template parameters. This would be quite a burden, and would require a unique template instantiation of the entire algorithm for each unique set of alignments.

A more feasible compromise might be to add an extra argument that's essentially a std::integral_constant<std::size_t, ALIGN>. ALIGN would specify the alignment of all input/output buffers, and would default to 0 meaning "inspect the pointers". This will require consistent alignments across buffers to achieve the optimization, but would avoid many of the template instantiation issues.

Would that be suitable for your usecase?

alliepiper · 2021-05-11T16:45:03Z

Alternatively, it might make sense to introduce a tagged pointer type that carries alignment info. It'd still be a headache from a template standpoint, but it would be a nicer interface.

jrhemstad · 2021-05-11T17:33:29Z

all of the alignments would need to be specified as template parameters. This would be quite a burden, and would require a unique template instantiation of the entire algorithm for each unique set of alignments.

Agreed, that's why I don't think it's really a solvable problem without making the algorithm variadic.

specify the alignment of all input/output buffers

I think this is the only reasonable, non-variadic solution. Though I don't think it requires an extra integral_constant parameter. We can just use cuda::aligned_size_t as the value_type of the Size iterator. Same as what's done for cuda::memcpy_async.

alliepiper · 2021-05-12T16:17:57Z

We can just use cuda::aligned_size_t as the value_type of the Size iterator.

Good point -- that would be ideal. Since we're adding a libcu++ dependency soon this should be totally doable.

gevtushenko · 2021-06-09T10:29:12Z

We might consider a generalized version of this API. The original issue looks like this.

It's helpful to have a mapping for ranges within sources and destinations. In this case, we can introduce BatchMemcpyGather and BatchMemcpyScatter facilities.

I suppose a fixed mapping group size per source/destination pair is sufficient. It's equal to 64 bytes for the int32 arrays above.

brycelelbach · 2021-06-10T01:50:46Z

I'd like to see a few things happen here:

CUB device-level batched data movement kernel (RAPIDS ask).
CUB block-, warp-, and thread-level (a)synchronous batched data movement APIs.
CUB thread-level asynchronous data movement APIs.
- Deploy these within CUB's block- and warp-level synchronous load/store/exchange APIs.
- See cub::ThreadLoadAsync and friends, abstractions for asynchronous data movement #209 for some initial explorations here.
Vectorization for all of the above.

elstehle · 2021-06-21T19:21:57Z

How do we generally feel about taking an extra parameter (max_total_bytes) that represents an upper bound on the total number of bytes that we expect to be copied (summed over all the buffers' sizes)? This would allow us to request some temp_storage that we could use for load balancing amongst thread blocks.

template <typename InputBufferIt, typename OutputBufferIt, typename BeginSizeIteratorT, typename EndSizeIteratorT, typename OffsetT>
BatchMemcpy(void *d_temp_storage, size_t &temp_storage_bytes, InputBufferIt first_input_buffer, InputBufferIt last_input_buffer, BeginSizeIteratorT first_buffer_size, OutputBufferIt first_output_buffer, OffsetT max_total_bytes){
   static_assert( std::is_pointer_v< std::iterator_traits<InputBufferIt>::value_type > );
   static_assert( std::is_pointer_v< std::iterator_traits<OutputBufferIt>::value_type > );
...
}

Other CUB algorithms currently have num_items as host value. Here we have iterators that can be dereferenced on the device only. In this case, we could compute the temp_storage_bytes based on max_total_bytes.

I expect temp_storage_bytes will be a fraction of the total number of items (e.g., <1% of N). Similarly, we'll be incurring ~1% more memory transfers. I hope that we can get robust runtimes at (close to) peak memory BW for the whole range of batch sizes in exchange.

alliepiper · 2021-06-22T15:10:37Z

Can you elaborate on what the temp storage is used for in this case?

Could max_total_bytes be optional in case it's not known, or if the user has to handle highly variable loads?

It should be fine to include that as an optimization, but I'd still like to write generic usages where the upper bound is unknown.

jrhemstad · 2021-06-28T15:27:27Z

Here we have iterators that can be dereferenced on the device only.

Actually, when I first envisioned this API, I was thinking the size iterator would be host accessible. But it's not obvious to me if that's the right decision or not.

elstehle · 2021-06-28T15:40:38Z

Actually, when I first envisioned this API, I was thinking the size iterator would be host accessible. But it's not obvious to me if that's the right decision or not.

Thanks for clarifying, @jrhemstad. I'm inclined to not make it a requirement that the iterators are accessible from the host as well. Iirc, all iterators in CUB are currently only accessed from the device. I also think that there's use cases where this will be an algorithm that will be called in succession of another algorithm that has previously run on the GPU. If it'd be a requirement to have the size iterator be host-accessible too, then this would imply a cudaDeviceSynchronize between the first algorithm, which was running on the GPU and has generated the buffer sizes as part of its device-side output, and the BatchMemcpy which would now require those sizes to be available on the host. I'd prefer to avoid that.

On another note, I think I have found a viable, load-balanced solution that makes the temp_storage_bytes be linear in the number of buffers rather than linear in the total number of bytes being copied. I'll follow up with the proposal shortly.

elstehle · 2023-01-10T07:21:59Z

This feature request has been addressed by PR #359 that is now merged.

jakirkham · 2023-01-12T03:07:12Z

Excited to see this has landed! 🥳

Is the idea still to include this in 2.1.0? If so, when is that release scheduled? Just trying to get an idea for planning purposes. Thanks! 🙏

jrhemstad added type: enhancement New feature or request. helps: rapids Helps or needed by RAPIDS. labels May 10, 2021

alliepiper assigned gevtushenko Jun 8, 2021

alliepiper added this to the 1.14.0 milestone Jun 8, 2021

jrhemstad mentioned this issue Jun 8, 2021

[FEA] Multiple buffer copy kernel rapidsai/cudf#7076

Closed

alliepiper assigned alliepiper and elstehle and unassigned gevtushenko and alliepiper Jun 8, 2021

elstehle mentioned this issue Aug 5, 2021

Adds BlockRunLengthDecode algorithm and tests #354

Merged

4 tasks

alliepiper modified the milestones: 1.14.0, 1.15.0 Aug 17, 2021

alliepiper added the P1: should have Necessary, but not critical. label Aug 17, 2021

alliepiper modified the milestones: 1.15.0, 1.16.0 Oct 15, 2021

jrhemstad mentioned this issue Nov 29, 2021

Load balance optimization for contiguous_split rapidsai/cudf#9755

Merged

alliepiper modified the milestones: 1.16.0, 1.17.0 Feb 7, 2022

alliepiper modified the milestones: 1.17.0, 2.0.0 May 5, 2022

alliepiper modified the milestones: 2.0.0, 2.1.0 Jul 25, 2022

jrhemstad added this to CCCL Aug 11, 2022

jrhemstad moved this to Needs Triage in CCCL Aug 14, 2022

jrhemstad removed the status in CCCL Aug 14, 2022

jrhemstad mentioned this issue Sep 9, 2022

Adds DeviceBatchMemcpy algorithm and tests #359

Merged

elstehle closed this as completed Jan 10, 2023

github-project-automation bot moved this to Done in CCCL Jan 10, 2023

abellina mentioned this issue Feb 3, 2023

Make all buffers/columnar batches spillable by default NVIDIA/spark-rapids#7672

Closed

abellina mentioned this issue Mar 7, 2023

[FEA][JNI] Leverage cub's multi-buffer copy algorithm in JNI bindings rapidsai/cudf#12899

Open

alliepiper mentioned this issue Mar 21, 2022

Feature Request: Multiway count_if/copy_if NVIDIA/cccl#799

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Multi-buffer copy algorithm #297

[FEA] Multi-buffer copy algorithm #297

jrhemstad commented May 10, 2021 •

edited

Loading

alliepiper commented May 10, 2021

jrhemstad commented May 10, 2021 •

edited

Loading

alliepiper commented May 11, 2021

alliepiper commented May 11, 2021

jrhemstad commented May 11, 2021

alliepiper commented May 12, 2021

gevtushenko commented Jun 9, 2021

brycelelbach commented Jun 10, 2021

elstehle commented Jun 21, 2021 •

edited

Loading

alliepiper commented Jun 22, 2021

jrhemstad commented Jun 28, 2021

elstehle commented Jun 28, 2021

elstehle commented Jan 10, 2023

jakirkham commented Jan 12, 2023

[FEA] Multi-buffer copy algorithm #297

[FEA] Multi-buffer copy algorithm #297

Comments

jrhemstad commented May 10, 2021 • edited Loading

alliepiper commented May 10, 2021

jrhemstad commented May 10, 2021 • edited Loading

alliepiper commented May 11, 2021

alliepiper commented May 11, 2021

jrhemstad commented May 11, 2021

alliepiper commented May 12, 2021

gevtushenko commented Jun 9, 2021

brycelelbach commented Jun 10, 2021

elstehle commented Jun 21, 2021 • edited Loading

alliepiper commented Jun 22, 2021

jrhemstad commented Jun 28, 2021

elstehle commented Jun 28, 2021

elstehle commented Jan 10, 2023

jakirkham commented Jan 12, 2023

jrhemstad commented May 10, 2021 •

edited

Loading

jrhemstad commented May 10, 2021 •

edited

Loading

elstehle commented Jun 21, 2021 •

edited

Loading