Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix batch processing for parquet writer #13438

Merged
merged 9 commits into from
May 25, 2023
Merged
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 12 additions & 6 deletions cpp/src/io/parquet/writer_impl.cu
Original file line number Diff line number Diff line change
Expand Up @@ -1833,17 +1833,23 @@ auto convert_table_to_parquet_data(table_input_metadata& table_meta,
// Initialize data pointers in batch
uint32_t const num_stats_bfr =
(stats_granularity != statistics_freq::STATISTICS_NONE) ? num_pages + num_chunks : 0;
rmm::device_buffer uncomp_bfr(max_uncomp_bfr_size, stream);
rmm::device_buffer comp_bfr(max_comp_bfr_size, stream);
rmm::device_buffer col_idx_bfr(column_index_bfr_size, stream);
std::vector<rmm::device_buffer> uncomp_bfr;
std::vector<rmm::device_buffer> comp_bfr;
std::vector<rmm::device_buffer> col_idx_bfr;
for (std::size_t i = 0; i < batch_list.size(); ++i) {
uncomp_bfr.emplace_back(max_uncomp_bfr_size, stream);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can lead to a large overuse of memory if there are unbalanced batches. For instance, in testing I had a file that did one batch of 14 row groups, and a second batch of a single row group, all sized the same. For that case this will have two buffers large enough for 14 rowgroups, 28 total, instead of the 15 needed. Could you add a note here to that effect?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that the alternative solution (#13438 (comment)) should be better. With that, we won't have overuse memory.

comp_bfr.emplace_back(max_comp_bfr_size, stream);
col_idx_bfr.emplace_back(column_index_bfr_size, stream);
}

rmm::device_uvector<gpu::EncPage> pages(num_pages, stream);

// This contains stats for both the pages and the rowgroups. TODO: make them separate.
rmm::device_uvector<statistics_chunk> page_stats(num_stats_bfr, stream);
auto bfr_i = static_cast<uint8_t*>(col_idx_bfr.data());
for (auto b = 0, r = 0; b < static_cast<size_type>(batch_list.size()); b++) {
auto bfr = static_cast<uint8_t*>(uncomp_bfr.data());
auto bfr_c = static_cast<uint8_t*>(comp_bfr.data());
auto bfr_i = static_cast<uint8_t*>(col_idx_bfr[b].data());
auto bfr = static_cast<uint8_t*>(uncomp_bfr[b].data());
auto bfr_c = static_cast<uint8_t*>(comp_bfr[b].data());
for (auto j = 0; j < batch_list[b]; j++, r++) {
for (auto i = 0; i < num_columns; i++) {
gpu::EncColumnChunk& ck = chunks[r][i];
Expand Down