-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement a chunked_pack API #13260
Implement a chunked_pack API #13260
Conversation
Co-authored-by: Dave Baranec <[email protected]> Signed-off-by: Alessandro Bellina <[email protected]>
I don't see major differences when running Benchmark results for this PR, with the `chunked_pack` ones at the end
Benchmark results before this PR
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First wave of comments.
Thanks @nvdbaranec @ttnghia, will address this afternoon :) |
@ttnghia @nvdbaranec @davidwendt let me know if you have further comments. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First pass of review, mostly high-level comments and minor editorial fixes. Will take another pass tomorrow to fully wrap my head around the algorithm.
[[nodiscard]] bool has_next() const; | ||
|
||
/** | ||
* @brief Packs the next chunk into `user_buffer`. This should be call as long as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* @brief Packs the next chunk into `user_buffer`. This should be call as long as | |
* @brief Packs the next chunk into `user_buffer`. This should be called as long as |
* The intent of this operation is to be used in a streamed fashion at times of GPU | ||
* out-of-memory, where we want to minimize the number of small cudaMemcpy calls and | ||
* tracking of all the metadata associated with cudf tables. Because of the memory constraints, | ||
* all thrust and scratch memory allocations are using the passed-in memory resource exclusively, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even with documentation, I think this behavior is unexpected given the general approach that libcudf takes. I understand why we might need to support this, but I would prefer it be communicated very clearly in the API and not rely solely on documentation. Can we modify the struct to have two memory resources, the usual mr
and an extra temp_mr
or so? They can both default to the per-device resource. That way the caller gets a very clear indication that the memory resource used for temp allocations is also controllable in this particular instance.
If all allocations made by this function are temporary because it's just splitting an existing buffer, then maybe just renaming the mr makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
chunked_pack
never allocates result buffers. So I am going to replace mr
in this api with temp_mr
to make it clear, as you suggest
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource()); | ||
|
||
/** | ||
* @brief Destructor that will be implemented as default, required because |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* @brief Destructor that will be implemented as default, required because | |
* @brief Destructor that will be implemented as default. Declared with definition here because |
cpp/src/copying/contiguous_split.cu
Outdated
CUDF_EXPECTS(user_buffer_size >= desired_batch_size, | ||
"The output buffer size must be at least 1MB in size"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why isn't this error in the constructor?
cpp/src/copying/contiguous_split.cu
Outdated
* return true when there is work to be done, and false otherwise. | ||
* | ||
* contiguous_split_state::contiguous_split() performs a single-pass contiguous_split | ||
* and is only valid iff contiguous_split_state is instantiated with 0 for the user_buffer_size. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only is redundant with "iff".
* and is only valid iff contiguous_split_state is instantiated with 0 for the user_buffer_size. | |
* and is valid iff contiguous_split_state is instantiated with 0 for the user_buffer_size. |
Unless "iff" is a typo, in which case use
* and is only valid iff contiguous_split_state is instantiated with 0 for the user_buffer_size. | |
* and is only valid if contiguous_split_state is instantiated with 0 for the user_buffer_size. |
cpp/src/copying/contiguous_split.cu
Outdated
current_iteration(0), | ||
starting_batch(0), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nit
current_iteration(0), | |
starting_batch(0), | |
current_iteration{0}, | |
starting_batch{0}, |
CUDF_EXPECTS(current_iteration < num_iterations, | ||
"current_iteration cannot exceed num_iterations"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can an external caller ever trigger this exception, or would this be purely an internal error? If the latter, maybe not something to throw an exception for, but more like an assert?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I guess this could happen if someone tried to advance the iteration on the state object without first checking if it has any chunks left?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeap, that's what it is protecting against.
return std::make_pair(starting_batch, count_for_current); | ||
} | ||
|
||
std::size_t advance_iteration() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add docs for these functions (this one and those below)? Don't need full doxygen, a simple description of inputs/output.
cpp/src/copying/contiguous_split.cu
Outdated
auto keys = cudf::detail::make_counting_transform_iterator( | ||
0, split_key_functor{static_cast<int>(num_src_bufs)}); | ||
auto values = | ||
cudf::detail::make_counting_transform_iterator(0, buf_size_functor{d_dst_buf_info}); | ||
|
||
thrust::reduce_by_key(rmm::exec_policy(stream), | ||
thrust::reduce_by_key(rmm::exec_policy(stream, mr), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made exactly the same comment independently in the header file. Adding a separate temp_mr
was my proposal.
cpp/src/copying/contiguous_split.cu
Outdated
std::vector<std::size_t> const h_size_of_buffs_per_iteration; | ||
}; | ||
|
||
std::unique_ptr<chunk_iteration_state> make_chunk_iteration_state( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An alternate solution here that would satisfy both requests is to define the function out of band, i.e.
class chunk_iteration_state {
...
static auto create(...);
...
}
static auto chunk_iteration_state::create(...) {...}
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know some of this code is just moved around so these are just suggestions for improvement.
Co-authored-by: David Wendt <[email protected]>
I think I have addressed all issues known to me @vyasr and @davidwendt let me know if you have other things you want me to tackle. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some suggestions for improvement and a couple of questions. I didn't appreciate at first how much of the code is simply moved around rather than new, so I won't push for too many further changes here. I think we can merge this and iterate further later if desired.
I personally find some of the organization confusing. For instance, it's not clear when constructors do a lot of work, when they just set up pointers to preexisting allocations, or when they do nearly nothing. It's also not immediately obvious when things are implemented as methods or not, and which pieces of state are getting passed around (there are lots of struct members being accessed directly then handed off to some other algorithm). I believe I've made sense of it after a longer read-through, but I'm sure I'm missing lots of details. Would be good to come back and try to clarify more of the structure.
cpp/src/copying/contiguous_split.cu
Outdated
std::unique_ptr<chunk_iteration_state> chunk_iter_state; | ||
|
||
// Two modes are allowed: | ||
// - user provided buffer: as the name implies, the user has provided a buffer that must be at |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this context a user is not a libcudf user, but rather libcudf code making use of this struct, right? In particular, contiguous_split
vs chunked_pack
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is correct. I will clarify this comment, it was written before we started calling this chunked_pack
.. before this it was chunked_contiguous_split
which was all kinds of confusing :)
cpp/src/copying/contiguous_split.cu
Outdated
return result; | ||
} | ||
|
||
cudf::table_view const input; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should be documented using the standard doxygen style for members, see the example in our dev docs. For the members with very long descriptions maybe use a doxygen-style short description and a separate comment for the longer explanation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added to chunked_iteration_state
and to contiguous_split_state
. I didn't add for the other structs in the anonymous namespace for now, but can go back to it if all member should get this treatment.
cpp/src/copying/contiguous_split.cu
Outdated
max(std::size_t{1}, util::round_up_unsafe(bytes, desired_chunk_size) / desired_chunk_size); | ||
// packed block of memory 1: split indices and src_buf_info structs | ||
struct packed_split_indices_and_src_buf_info { | ||
explicit packed_split_indices_and_src_buf_info(cudf::table_view const& input, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason this constructor is marked explicit while none of the other structs are?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No reason. I think this was me not understanding explicit
.
What are the rules for explicit
in cuDF?
As far as I know, in general, the idea is to make sure the constructor is invoked, well, explicitly, rather than via a conversion. With multiple-argument constructors, I think the only conversion one would "protect" against is that via the initializer list. But I think initializer lists are preferred, so by marking constructors explicit
I am probably going against that, forcing the constructor to be invoked directly. Is my understanding correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the explicit here to make them consistent, but I am curious on the above for my own edification.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're understanding seems correct to me: https://en.cppreference.com/w/cpp/language/explicit
Given that this an internal class I don't think there will be much chance of ambiguity between the ctor and an arbitrary init list.
…faults to the default memory resource
…er private functions. Make compute_batches a free function
Thanks @nvdbaranec @ttnghia @davidwendt @vyasr for all the reviews! I'll be able to revisit this code base soon with #13332, so I am going to continue to iterate to handle the feedback that was not addressed in this PR. |
/merge |
This PR is the standalone JNI side of #13260. Therefore it doesn't build in as is, but I am putting this up as draft for the Java reviewers to start taking a look. This implements a `ChunkedPack` java class mirroring the interface of `cudf::chunked_pack`. This is an iterator-like class that can be used to invoke `cudf::pack` (aka `cudf::contigous_split` without splits) over several iterations against a bounce buffer. In order to create a `ChunkedPack`, the user calls `makeChunkedPack` from a `Table` instance. During this call the user can also pass an `RmmDeviceMemoryResource` to be used internally by `cudf::chunked_pack` exclusively for scratch/temporary data. Authors: - Alessandro Bellina (https://github.com/abellina) Approvers: - Robert (Bobby) Evans (https://github.com/revans2) - Nghia Truong (https://github.com/ttnghia) URL: #13278
Description
This PR introduces a "chunked pack" API built on top of
contiguous_split
. This API is used when we want to copy a cuDFtable_view
over the wire or to the host in a contiguous layout (akacontiguous_split
), but without user provided memory. As a result this API does not allocate any buffers for GPU data and instead it uses a provided user buffer to perform thecontiguous_split
in chunks.Luckily,
contiguous_split
already had a subdivision of work that we are now calling "batches". Each batch is up to 1MB of data from the source table. As such, one can think of this function as copying as many batches as will fit in a user buffer (or "chunking the batches"). The API follows other chunked interfaces in cuDF with ahas_next
and anext
, with the difference that in this casenext
takes adevice_span
, and the user can provide anydevice_span
as long as the size is the same as the size specified during construction.When thrust and scratch space is required on the GPU, this PR makes use of the memory resource passed, even to the point of using the second argument of
exec_policy
which is the memory resource. I found this while testing thatexec_policy
defaults to the per-device resource, and in this case I really wanted to pass a pooled memory resource, outside of our normal async memory resource to set aside this memory ahead of time.Most of the changes are about moving things around. You'll see 3 structs with the name "packed" (e.g.
packed_split_indices_and_src_buf_info
). These are here to group together state thatcontiguous_split
needs to work, but now because ofchunked_pack
we need to keep around as well for the subsequent calls tonext
. These structs are also packed in memory, which is an optimization contiguous_split had already done to reduce the number of d2h/h2d copies. This PR did need to add a state object (contiguous_split_state
) that nowcontiguous_split
leverages. It also makes use of themetadata_builder
which we added in a prior PR.This PR does not include the JNI changes needed for this to work on the java side, I'll post that separately. I figured this was too big already (and if people have suggestions on "chunking" this PR up, I am happy to do that).
@nvdbaranec spent a great deal of time documenting
contiguous_split
for me and he suggested a path to get this done that I just followed (thank you!!)Checklist