-
Notifications
You must be signed in to change notification settings - Fork 920
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce peak memory use when writing compressed ORC files. #12963
Reduce peak memory use when writing compressed ORC files. #12963
Conversation
…reduce-orc-writer-mem
…reduce-orc-writer-mem
…reduce-orc-writer-mem
…reduce-orc-writer-mem
…reduce-orc-writer-mem
…reduce-orc-writer-mem
This does increase memory usage when compression is not used. |
…into reduce-orc-writer-mem
…reduce-orc-writer-mem
} | ||
if (!t) { strm_desc[stripe_id][stream_id].stream_size = dst_ptr - strm0.data_ptrs[cid]; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need to set the stream size, its been computed on the host
// Allow extra space for alignment | ||
stripe_size += strm.lengths[strm_type] + uncomp_block_align - 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a potential bug fix for extreme corner cases where alignment can push the writing of encoded data into the next stream.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential bug as in we haven't seen this before in practice? Do we have a test for it? Should we?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to have this test, but it's not trivial to come up with the failing input.
Maybe a decimal column + ZSTD, since we use the exact size (and it doesn't have to be a multiple of 4, unlike floats)? I'll look into this, just not for this PR :)
…reduce-orc-writer-mem
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new logic is much simpler. Only some non-blocking suggestions.
Thanks for the work!
auto const& ck = chunks[col_idx][rg_idx]; | ||
auto& strm = col_streams[rg_idx]; | ||
// per-stripe, per-stream owning buffers | ||
std::vector<std::vector<rmm::device_uvector<uint8_t>>> encoded_data(segmentation.num_stripes()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not in this PR, but this reminds me that we could extract the owning buffer strong type here (
cudf/cpp/include/cudf/io/datasource.hpp
Line 352 in e37bddb
class owning_buffer : public buffer { |
owning_buffer
. This also helps unify data type by using std::byte
consistently.
Also, owning buffer could be enforced via unique ptr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I get this one. rmm::device_uvector
already implies an owning buffer. We know that this type deallocates the memory in the destructor.
In the datasource::buffer we have the distinction between owning and non-owning buffers because we use the abstract buffer type, which does not imply ownership, only the access API to the contained data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rmm::device_uvector already implies an owning buffer
My intention was to use a strong type like:
struct owning_buffer{
...
std::unique_ptr<rmm::device_uvector<uint8_t>> data;
};
to explicitly represent owning buffers in cuIO.
std::vector<StripeInformation> gather_stripes(size_t num_index_streams, | ||
file_segmentation const& segmentation, | ||
encoded_data* enc_data, | ||
hostdevice_2dvector<gpu::StripeStream>* strm_desc, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: For non-iterator parameter, try to avoid pointer.
hostdevice_2dvector<gpu::StripeStream>* strm_desc, | |
hostdevice_2dvector<gpu::StripeStream>& strm_desc, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We used a pointer here to make it obvious at the call site that the parameter will the modified. I personally prefer references for non-optional parameters.
Edit: to clarify - the use of a pointer here was requested in the code review, I initially used a reference.
Co-authored-by: Nghia Truong <[email protected]>
…into reduce-orc-writer-mem
Ran unit tests with cuda-memcheck on this branch, no errors found 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Have a question about the bug fix, but if we want to do something there I would suggest a new issue.
// blockDim {1024,1,1} | ||
__global__ void __launch_bounds__(1024) | ||
// blockDim {compact_streams_block_size,1,1} | ||
__global__ void __launch_bounds__(compact_streams_block_size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔥
// Allow extra space for alignment | ||
stripe_size += strm.lengths[strm_type] + uncomp_block_align - 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential bug as in we haven't seen this before in practice? Do we have a test for it? Should we?
/merge |
for (size_t col_idx = 0; col_idx < num_columns; col_idx++) { | ||
for (int strm_type = 0; strm_type < gpu::CI_NUM_STREAMS; ++strm_type) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why mix post-increment and pre-increment here?
// gathered stripes - per-stripe, per-stream (same as encoded_data.data) | ||
std::vector<std::vector<rmm::device_uvector<uint8_t>>> gathered_stripes(enc_data->data.size()); | ||
for (auto& stripe_data : gathered_stripes) { | ||
std::generate_n(std::back_inserter(stripe_data), enc_data->data[0].size(), [&]() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would a stripe_data.reserve(enc_data->data[0].size());
help here?
Description
This PR changes how the buffer for encoded data is allocated in the ORC writer. Instead of a single buffer for the whole table, each stream of each stripe is allocated separately.
Since size of the encoded data is not known in advance, buffers are over sized in most cases (decimal types and dictionary encoded data being the exceptions). Resizing these buffers to the exact encoded data size before compression reduces peak memory usage.
The resizing of the encoded buffers is done in the step where row groups are gathered to make contiguous encoded stripe in memory. This way we don't incur additional copies (compared to previous approach to
gather_stripes
).Other changes:
Removed
compute_offsets
because it is not needed with separate buffers for each stripe/stream.Refactored parts of
encode_columns
to initialize data buffers + stream descriptors one stripe at a time, allowing future separation into per-stripe processing (for e.g. pipelining).Impact: internal benchmarks show average reduction of peak memory use of 14% when SNAPPY compression is enabled, with minimal impact on performance.
cub::DeviceMemcpy::Batched can now be used in the ORC writer.
Checklist