Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce peak memory use when writing compressed ORC files. #12963

Merged
merged 38 commits into from
Apr 3, 2023

Conversation

vuule
Copy link
Contributor

@vuule vuule commented Mar 17, 2023

Description

This PR changes how the buffer for encoded data is allocated in the ORC writer. Instead of a single buffer for the whole table, each stream of each stripe is allocated separately.
Since size of the encoded data is not known in advance, buffers are over sized in most cases (decimal types and dictionary encoded data being the exceptions). Resizing these buffers to the exact encoded data size before compression reduces peak memory usage.
The resizing of the encoded buffers is done in the step where row groups are gathered to make contiguous encoded stripe in memory. This way we don't incur additional copies (compared to previous approach to gather_stripes).

Other changes:
Removed compute_offsets because it is not needed with separate buffers for each stripe/stream.
Refactored parts of encode_columns to initialize data buffers + stream descriptors one stripe at a time, allowing future separation into per-stripe processing (for e.g. pipelining).

Impact: internal benchmarks show average reduction of peak memory use of 14% when SNAPPY compression is enabled, with minimal impact on performance.
cub::DeviceMemcpy::Batched can now be used in the ORC writer.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Mar 17, 2023
@vuule vuule added cuIO cuIO issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change and removed libcudf Affects libcudf (C++/CUDA) code. labels Mar 17, 2023
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Mar 17, 2023
@vuule
Copy link
Contributor Author

vuule commented Mar 17, 2023

This does increase memory usage when compression is not used.
Overview:
With SNAPPY, average peak memory use goes from 1.66GB to 1.43GB. 👍
Without compression, average peak memory use goes from 867MB to 1GB. 👎

@vuule vuule marked this pull request as ready for review March 19, 2023 07:41
@vuule vuule requested a review from a team as a code owner March 19, 2023 07:41
}
if (!t) { strm_desc[stripe_id][stream_id].stream_size = dst_ptr - strm0.data_ptrs[cid]; }
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to set the stream size, its been computed on the host

Comment on lines +1112 to +1113
// Allow extra space for alignment
stripe_size += strm.lengths[strm_type] + uncomp_block_align - 1;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a potential bug fix for extreme corner cases where alignment can push the writing of encoded data into the next stream.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential bug as in we haven't seen this before in practice? Do we have a test for it? Should we?

Copy link
Contributor Author

@vuule vuule Mar 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to have this test, but it's not trivial to come up with the failing input.
Maybe a decimal column + ZSTD, since we use the exact size (and it doesn't have to be a multiple of 4, unlike floats)? I'll look into this, just not for this PR :)

@vuule vuule requested a review from ttnghia March 28, 2023 07:03
Copy link
Member

@PointKernel PointKernel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new logic is much simpler. Only some non-blocking suggestions.

Thanks for the work!

cpp/src/io/orc/orc_gpu.hpp Show resolved Hide resolved
cpp/src/io/orc/writer_impl.hpp Show resolved Hide resolved
cpp/src/io/orc/stripe_enc.cu Outdated Show resolved Hide resolved
cpp/src/io/orc/stripe_enc.cu Show resolved Hide resolved
auto const& ck = chunks[col_idx][rg_idx];
auto& strm = col_streams[rg_idx];
// per-stripe, per-stream owning buffers
std::vector<std::vector<rmm::device_uvector<uint8_t>>> encoded_data(segmentation.num_stripes());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not in this PR, but this reminds me that we could extract the owning buffer strong type here (

class owning_buffer : public buffer {
) and apply it across cuIO code. Thus the above code would be a 2D vector of owning_buffer. This also helps unify data type by using std::byte consistently.

Also, owning buffer could be enforced via unique ptr.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I get this one. rmm::device_uvector already implies an owning buffer. We know that this type deallocates the memory in the destructor.
In the datasource::buffer we have the distinction between owning and non-owning buffers because we use the abstract buffer type, which does not imply ownership, only the access API to the contained data.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rmm::device_uvector already implies an owning buffer

My intention was to use a strong type like:

struct owning_buffer{
  ...
  std::unique_ptr<rmm::device_uvector<uint8_t>> data;
};

to explicitly represent owning buffers in cuIO.

std::vector<StripeInformation> gather_stripes(size_t num_index_streams,
file_segmentation const& segmentation,
encoded_data* enc_data,
hostdevice_2dvector<gpu::StripeStream>* strm_desc,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: For non-iterator parameter, try to avoid pointer.

Suggested change
hostdevice_2dvector<gpu::StripeStream>* strm_desc,
hostdevice_2dvector<gpu::StripeStream>& strm_desc,

Copy link
Contributor Author

@vuule vuule Mar 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We used a pointer here to make it obvious at the call site that the parameter will the modified. I personally prefer references for non-optional parameters.
Edit: to clarify - the use of a pointer here was requested in the code review, I initially used a reference.

@vuule
Copy link
Contributor Author

vuule commented Mar 29, 2023

Ran unit tests with cuda-memcheck on this branch, no errors found 👍

@vuule vuule added the 5 - Ready to Merge Testing and reviews complete, ready to merge label Mar 29, 2023
Copy link
Contributor

@hyperbolic2346 hyperbolic2346 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Have a question about the bug fix, but if we want to do something there I would suggest a new issue.

// blockDim {1024,1,1}
__global__ void __launch_bounds__(1024)
// blockDim {compact_streams_block_size,1,1}
__global__ void __launch_bounds__(compact_streams_block_size)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥

Comment on lines +1112 to +1113
// Allow extra space for alignment
stripe_size += strm.lengths[strm_type] + uncomp_block_align - 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential bug as in we haven't seen this before in practice? Do we have a test for it? Should we?

@vuule vuule changed the base branch from branch-23.04 to branch-23.06 March 30, 2023 18:29
@vuule
Copy link
Contributor Author

vuule commented Apr 3, 2023

/merge

@rapids-bot rapids-bot bot merged commit 09b114e into rapidsai:branch-23.06 Apr 3, 2023
@vuule vuule deleted the reduce-orc-writer-mem branch April 3, 2023 15:26
Comment on lines +1070 to +1071
for (size_t col_idx = 0; col_idx < num_columns; col_idx++) {
for (int strm_type = 0; strm_type < gpu::CI_NUM_STREAMS; ++strm_type) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why mix post-increment and pre-increment here?

// gathered stripes - per-stripe, per-stream (same as encoded_data.data)
std::vector<std::vector<rmm::device_uvector<uint8_t>>> gathered_stripes(enc_data->data.size());
for (auto& stripe_data : gathered_stripes) {
std::generate_n(std::back_inserter(stripe_data), enc_data->data[0].size(), [&]() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would a stripe_data.reserve(enc_data->data[0].size()); help here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge cuIO cuIO issue improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants