Skip to content
This repository has been archived by the owner on Mar 21, 2024. It is now read-only.

[FEA] Multi-buffer copy algorithm #297

Closed
jrhemstad opened this issue May 10, 2021 · 14 comments
Closed

[FEA] Multi-buffer copy algorithm #297

jrhemstad opened this issue May 10, 2021 · 14 comments
Assignees
Labels
helps: rapids Helps or needed by RAPIDS. P1: should have Necessary, but not critical. type: enhancement New feature or request.
Milestone

Comments

@jrhemstad
Copy link
Collaborator

jrhemstad commented May 10, 2021

I have N input buffers that I want to copy to N output buffers. I could sequentially call cudaMemcpyAsync N times, but in most cases it would be faster to launch a single kernel that performs all N copies.

I think such a primitive would be a good fit as a CUB algorithm.

I imagine the API would be something like:

template <typename InputBufferIt, typename OutputBufferIt, typename BeginSizeIteratorT, typename EndSizeIteratorT>
BatchMemcpy(void *d_temp_storage, size_t &temp_storage_bytes, InputBufferIt first_input_buffer, InputBufferIt last_input_buffer, BeginSizeIteratorT first_buffer_size, OutputBufferIt first_output_buffer){
   static_assert( std::is_pointer_v< std::iterator_traits<InputBufferIt>::value_type > );
   static_assert( std::is_pointer_v< std::iterator_traits<OutputBufferIt>::value_type > );
...
}

There's some issues with this API I haven't figure out yet:

  • I don't think the input/output can/should be iterators. Like DeviceSegmentedRadixSort, I think the in/out need to be raw pointers. Otherwise, how do you accept multiple iterators of potentially different types? Make the algorithm variadic? Maybe.
  • The sizes of each buffer is an iterator to allow using something like aligned_size_t, but how do you specify different alignments for each buffer?

Related: rapidsai/cudf#7076

@jrhemstad jrhemstad added type: enhancement New feature or request. helps: rapids Helps or needed by RAPIDS. labels May 10, 2021
@alliepiper
Copy link
Collaborator

My initial thoughts:

  • I'd go with Memcpy instead of Copy to reinforce that this is a bitwise copy between raw memory segments (as opposed to invoking C++ copy operators, etc).
  • After thinking about this some more, this isn't a Segmented operation, just a batch operation. The CUB Segmented algorithms all assume a single input buffer broken up into segments, indexed by offset. This uses multiple disjoint buffers with associated sizes. So maybe BatchMemcpy?
  • Use different template types for the begin/in/output size iterators (this has bitten users in the segmented algorithms, see Allow segmented problems to have different types for offset iterator #229 / Allow segmented problems to have different types for offset iterators. #291).
  • What does output_start_sizes represent? Shouldn't everything about output_buffers be deducible from the other args?

I don't think the input/output can/should be iterators. Like DeviceSegmentedRadixSort, I think the in/out need to be raw pointers. Otherwise, how do you accept multiple iterators of potentially different types? Make the algorithm variadic? Maybe.

I agree that the input/output ranges must be memory buffers and not iterators, but ideally the outer dimension could be an iterator and the inner dimension could just be "pointer-like". For example,

std::vector<thrust::device_pointer<int>> input = ...;
std::vector<thrust::device_pointer<int>> output = ...;
BatchMemcpy(..., input.begin(), ..., output.begin(), ...);

should work ideally. If we do support this, we'll need to make sure that we have a good diagnostic when a buffer isn't convertible to a raw pointer.

The sizes of each buffer is an iterator to allow using something like aligned_size_t, but how do you specify different alignments for each buffer?

I may be missing something, but since this is a bitwise memcpy, I don't think alignment matters. The memcpy implementation should determine the best alignment/word size to use for copying, and break up the copies into appropriate chunks.

@jrhemstad
Copy link
Collaborator Author

jrhemstad commented May 10, 2021

I like BatchMemcpy.

Use different template types for the begin/in/output size iterators

Done.

What does output_start_sizes represent?

That was a mistake.

ideally the outer dimension could be an iterator and the inner dimension could just be "pointer-like"

Agreed, I think this is easy enough to static_assert with appropriate traits (is_pointer may not be sufficient for Thrust fancy pointers).

I may be missing something, but since this is a bitwise memcpy, I don't think alignment matters.

It matters for getting good performance. In the worse case, the memcpy has to assume 1B alignment and use 1B load/stores, or introspect the pointers to determine the alignment and decide what size load/stores can be used. Introspecting the pointer can generate a lot of extra code that harms perf, so if you can statically specify the alignment, it is much better for perf.

I've updated the issue description based on your feedback.

@alliepiper
Copy link
Collaborator

if you can statically specify the alignment, it is much better for perf.

Makes sense.

how do you specify different alignments for each buffer?

I'm not sure there's a good way to do this. If this is for a static optimization, all of the alignments would need to be specified as template parameters. This would be quite a burden, and would require a unique template instantiation of the entire algorithm for each unique set of alignments.

A more feasible compromise might be to add an extra argument that's essentially a std::integral_constant<std::size_t, ALIGN>. ALIGN would specify the alignment of all input/output buffers, and would default to 0 meaning "inspect the pointers". This will require consistent alignments across buffers to achieve the optimization, but would avoid many of the template instantiation issues.

Would that be suitable for your usecase?

@alliepiper
Copy link
Collaborator

Alternatively, it might make sense to introduce a tagged pointer type that carries alignment info. It'd still be a headache from a template standpoint, but it would be a nicer interface.

@jrhemstad
Copy link
Collaborator Author

all of the alignments would need to be specified as template parameters. This would be quite a burden, and would require a unique template instantiation of the entire algorithm for each unique set of alignments.

Agreed, that's why I don't think it's really a solvable problem without making the algorithm variadic.

specify the alignment of all input/output buffers

I think this is the only reasonable, non-variadic solution. Though I don't think it requires an extra integral_constant parameter. We can just use cuda::aligned_size_t as the value_type of the Size iterator. Same as what's done for cuda::memcpy_async.

@alliepiper
Copy link
Collaborator

We can just use cuda::aligned_size_t as the value_type of the Size iterator.

Good point -- that would be ideal. Since we're adding a libcu++ dependency soon this should be totally doable.

@gevtushenko
Copy link
Collaborator

We might consider a generalized version of this API. The original issue looks like this.

image

It's helpful to have a mapping for ranges within sources and destinations. In this case, we can introduce BatchMemcpyGather and BatchMemcpyScatter facilities.

image

image

I suppose a fixed mapping group size per source/destination pair is sufficient. It's equal to 64 bytes for the int32 arrays above.

@brycelelbach
Copy link
Collaborator

I'd like to see a few things happen here:

  • CUB device-level batched data movement kernel (RAPIDS ask).
  • CUB block-, warp-, and thread-level (a)synchronous batched data movement APIs.
  • CUB thread-level asynchronous data movement APIs.
  • Vectorization for all of the above.

@elstehle
Copy link
Collaborator

elstehle commented Jun 21, 2021

How do we generally feel about taking an extra parameter (max_total_bytes) that represents an upper bound on the total number of bytes that we expect to be copied (summed over all the buffers' sizes)? This would allow us to request some temp_storage that we could use for load balancing amongst thread blocks.

template <typename InputBufferIt, typename OutputBufferIt, typename BeginSizeIteratorT, typename EndSizeIteratorT, typename OffsetT>
BatchMemcpy(void *d_temp_storage, size_t &temp_storage_bytes, InputBufferIt first_input_buffer, InputBufferIt last_input_buffer, BeginSizeIteratorT first_buffer_size, OutputBufferIt first_output_buffer, OffsetT max_total_bytes){
   static_assert( std::is_pointer_v< std::iterator_traits<InputBufferIt>::value_type > );
   static_assert( std::is_pointer_v< std::iterator_traits<OutputBufferIt>::value_type > );
...
}

Other CUB algorithms currently have num_items as host value. Here we have iterators that can be dereferenced on the device only. In this case, we could compute the temp_storage_bytes based on max_total_bytes.

I expect temp_storage_bytes will be a fraction of the total number of items (e.g., <1% of N). Similarly, we'll be incurring ~1% more memory transfers. I hope that we can get robust runtimes at (close to) peak memory BW for the whole range of batch sizes in exchange.

@alliepiper
Copy link
Collaborator

Can you elaborate on what the temp storage is used for in this case?

Could max_total_bytes be optional in case it's not known, or if the user has to handle highly variable loads?

It should be fine to include that as an optimization, but I'd still like to write generic usages where the upper bound is unknown.

@jrhemstad
Copy link
Collaborator Author

Here we have iterators that can be dereferenced on the device only.

Actually, when I first envisioned this API, I was thinking the size iterator would be host accessible. But it's not obvious to me if that's the right decision or not.

@elstehle
Copy link
Collaborator

Actually, when I first envisioned this API, I was thinking the size iterator would be host accessible. But it's not obvious to me if that's the right decision or not.

Thanks for clarifying, @jrhemstad. I'm inclined to not make it a requirement that the iterators are accessible from the host as well. Iirc, all iterators in CUB are currently only accessed from the device. I also think that there's use cases where this will be an algorithm that will be called in succession of another algorithm that has previously run on the GPU. If it'd be a requirement to have the size iterator be host-accessible too, then this would imply a cudaDeviceSynchronize between the first algorithm, which was running on the GPU and has generated the buffer sizes as part of its device-side output, and the BatchMemcpy which would now require those sizes to be available on the host. I'd prefer to avoid that.

On another note, I think I have found a viable, load-balanced solution that makes the temp_storage_bytes be linear in the number of buffers rather than linear in the total number of bytes being copied. I'll follow up with the proposal shortly.

@alliepiper alliepiper modified the milestones: 1.17.0, 2.0.0 May 5, 2022
@alliepiper alliepiper modified the milestones: 2.0.0, 2.1.0 Jul 25, 2022
@jrhemstad jrhemstad added this to CCCL Aug 11, 2022
@jrhemstad jrhemstad moved this to Needs Triage in CCCL Aug 14, 2022
@jrhemstad jrhemstad removed the status in CCCL Aug 14, 2022
@elstehle
Copy link
Collaborator

This feature request has been addressed by PR #359 that is now merged.

@github-project-automation github-project-automation bot moved this to Done in CCCL Jan 10, 2023
@jakirkham
Copy link

Excited to see this has landed! 🥳

Is the idea still to include this in 2.1.0? If so, when is that release scheduled? Just trying to get an idea for planning purposes. Thanks! 🙏

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
helps: rapids Helps or needed by RAPIDS. P1: should have Necessary, but not critical. type: enhancement New feature or request.
Projects
Archived in project
Development

No branches or pull requests

6 participants