-
Notifications
You must be signed in to change notification settings - Fork 919
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Row-group-level partitioning for Parquet #9849
Closed
Closed
Changes from all commits
Commits
Show all changes
11 commits
Select commit
Hold shift + click to select a range
27c785c
Add row_group_sizes parameter to to_parquet
calebwin c8710e9
Initialize page fragments using row group sizes
calebwin f360fdb
Decide rowgroup boundaries based on row_group_sizes
calebwin 035912c
Add row_group_cols parameter to cudf.to_parquet
calebwin d3f85de
Allow write_to_dataset to accept row_group_cols
calebwin e591c28
Fix compilation issues
calebwin fcfb2bb
Fix Python compile-time bug but failing in CUDA code
calebwin 0654742
Fix row_group_sizes in cudf.to_parquet
calebwin 6a86f1b
Fix issue with row_group_cols in to_parquet
calebwin c2a5309
Fix issue with integration with Dask cuDF
calebwin 6656167
Add docs for row_group_cols
calebwin File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -95,16 +95,18 @@ struct map_find_fn { | |
template <int block_size> | ||
__global__ void __launch_bounds__(block_size, 1) | ||
populate_chunk_hash_maps_kernel(cudf::detail::device_2dspan<EncColumnChunk> chunks, | ||
cudf::detail::device_2dspan<PageFragment> fragments, | ||
size_type num_rows) | ||
{ | ||
auto col_idx = blockIdx.y; | ||
auto block_x = blockIdx.x; | ||
auto t = threadIdx.x; | ||
|
||
auto start_row = | ||
block_x * | ||
max_page_fragment_size; // This is fragment size. all chunks are multiple of these many rows. | ||
size_type end_row = min(start_row + max_page_fragment_size, num_rows); | ||
auto start_row = 0; | ||
for (auto i = 0; i < block_x; i++) { | ||
start_row += fragments[0][i].num_rows; | ||
} | ||
size_type end_row = min(start_row + fragments[0][block_x].num_rows, num_rows); | ||
|
||
__shared__ EncColumnChunk* s_chunk; | ||
__shared__ parquet_column_device_view s_col; | ||
|
@@ -245,14 +247,18 @@ __global__ void __launch_bounds__(block_size, 1) | |
template <int block_size> | ||
__global__ void __launch_bounds__(block_size, 1) | ||
get_dictionary_indices_kernel(cudf::detail::device_2dspan<EncColumnChunk> chunks, | ||
cudf::detail::device_2dspan<PageFragment> fragments, | ||
size_type num_rows) | ||
{ | ||
auto col_idx = blockIdx.y; | ||
auto block_x = blockIdx.x; | ||
auto t = threadIdx.x; | ||
|
||
size_type start_row = block_x * max_page_fragment_size; | ||
size_type end_row = min(start_row + max_page_fragment_size, num_rows); | ||
auto start_row = 0; | ||
for (auto i = 0; i < block_x; i++) { | ||
start_row += fragments[0][i].num_rows; | ||
} | ||
Comment on lines
+257
to
+260
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same as previous comment. |
||
size_type end_row = min(start_row + fragments[0][block_x].num_rows, num_rows); | ||
|
||
__shared__ EncColumnChunk s_chunk; | ||
__shared__ parquet_column_device_view s_col; | ||
|
@@ -335,16 +341,16 @@ void initialize_chunk_hash_maps(device_span<EncColumnChunk> chunks, rmm::cuda_st | |
} | ||
|
||
void populate_chunk_hash_maps(cudf::detail::device_2dspan<EncColumnChunk> chunks, | ||
cudf::detail::device_2dspan<PageFragment> fragments, | ||
size_type num_rows, | ||
rmm::cuda_stream_view stream) | ||
{ | ||
constexpr int block_size = 256; | ||
auto const grid_x = cudf::detail::grid_1d(num_rows, max_page_fragment_size); | ||
auto const num_columns = chunks.size().second; | ||
dim3 const dim_grid(grid_x.num_blocks, num_columns); | ||
dim3 const dim_grid(fragments.size().second, num_columns); | ||
|
||
populate_chunk_hash_maps_kernel<block_size> | ||
<<<dim_grid, block_size, 0, stream.value()>>>(chunks, num_rows); | ||
<<<dim_grid, block_size, 0, stream.value()>>>(chunks, fragments, num_rows); | ||
} | ||
|
||
void collect_map_entries(device_span<EncColumnChunk> chunks, rmm::cuda_stream_view stream) | ||
|
@@ -354,16 +360,16 @@ void collect_map_entries(device_span<EncColumnChunk> chunks, rmm::cuda_stream_vi | |
} | ||
|
||
void get_dictionary_indices(cudf::detail::device_2dspan<EncColumnChunk> chunks, | ||
cudf::detail::device_2dspan<PageFragment> fragments, | ||
size_type num_rows, | ||
rmm::cuda_stream_view stream) | ||
{ | ||
constexpr int block_size = 256; | ||
auto const grid_x = cudf::detail::grid_1d(num_rows, max_page_fragment_size); | ||
auto const num_columns = chunks.size().second; | ||
dim3 const dim_grid(grid_x.num_blocks, num_columns); | ||
dim3 const dim_grid(fragments.size().second, num_columns); | ||
|
||
get_dictionary_indices_kernel<block_size> | ||
<<<dim_grid, block_size, 0, stream.value()>>>(chunks, num_rows); | ||
<<<dim_grid, block_size, 0, stream.value()>>>(chunks, fragments, num_rows); | ||
} | ||
} // namespace gpu | ||
} // namespace parquet | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
@@ -112,7 +112,7 @@ template <int block_size> | |||||||||||||
__global__ void __launch_bounds__(block_size) | ||||||||||||||
gpuInitPageFragments(device_2dspan<PageFragment> frag, | ||||||||||||||
device_span<parquet_column_device_view const> col_desc, | ||||||||||||||
uint32_t fragment_size, | ||||||||||||||
int32_t fragment_size, | ||||||||||||||
uint32_t max_num_rows) | ||||||||||||||
{ | ||||||||||||||
__shared__ __align__(16) frag_init_state_s state_g; | ||||||||||||||
|
@@ -130,7 +130,11 @@ __global__ void __launch_bounds__(block_size) | |||||||||||||
if (!t) { | ||||||||||||||
// frag.num_rows = fragment_size except for the last page fragment which can be smaller. | ||||||||||||||
// num_rows is fixed but fragment size could be larger if the data is strings or nested. | ||||||||||||||
s->frag.num_rows = min(fragment_size, max_num_rows - min(start_row, max_num_rows)); | ||||||||||||||
if (fragment_size != -1) { | ||||||||||||||
s->frag.num_rows = min(fragment_size, max_num_rows - min(start_row, max_num_rows)); | ||||||||||||||
} else { | ||||||||||||||
s->frag.num_rows = frag[blockIdx.x][blockIdx.y].num_rows; | ||||||||||||||
} | ||||||||||||||
Comment on lines
+133
to
+137
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||
s->frag.num_dict_vals = 0; | ||||||||||||||
s->frag.fragment_data_size = 0; | ||||||||||||||
s->frag.dict_data_size = 0; | ||||||||||||||
|
@@ -1938,13 +1942,13 @@ dremel_data get_dremel_data(column_view h_col, | |||||||||||||
* | ||||||||||||||
* @param[in,out] frag Fragment array [column_id][fragment_id] | ||||||||||||||
* @param[in] col_desc Column description array [column_id] | ||||||||||||||
* @param[in] num_fragments Number of fragments per column | ||||||||||||||
* @param[in] num_fragments Number of fragments per column, -1 if fragment sizes already specified | ||||||||||||||
* @param[in] num_columns Number of columns | ||||||||||||||
* @param[in] stream CUDA stream to use, default 0 | ||||||||||||||
*/ | ||||||||||||||
void InitPageFragments(device_2dspan<PageFragment> frag, | ||||||||||||||
device_span<parquet_column_device_view const> col_desc, | ||||||||||||||
uint32_t fragment_size, | ||||||||||||||
int32_t fragment_size, | ||||||||||||||
hyperbolic2346 marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||
uint32_t num_rows, | ||||||||||||||
rmm::cuda_stream_view stream) | ||||||||||||||
{ | ||||||||||||||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps these offsets should be pre-computed with a scan and then passed into the kernel? I'm not sure how many row groups we expect. The difference between 10 and 1M would indicate whether this should be a host or device computation.
If we shouldn't use a scan and pass in the precomputed offsets, then this could use
std::accumulate
. Might look something like this snippet (untested):(Note:
page
might not be the right name for the function argument, I am just guessing fromdevice_2dspan<PageFragment>
)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean like this:
cudf/cpp/src/io/parquet/chunk_dict.cu
Line 107 in f44a50b
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@devavret Looks about right! I'm just trying to avoid a loop on each thread when we could use a single-pass scan ahead of time. I see you've worked on this in #9810. That logic should be used here. Does #9810 need to be merged first?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, that's what I was wondering just now. #9810 is close to completion and if it is merged first, then there will be many merge issues with this PR. I'm fine with merging #9810 later or taking over this one if it remains unmerged due to merge issues.
cc @quasiben @vuule
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@devavret do you have an ETA for addressing current comments on #9810?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My 2cts:
Let's aim to merge this first so Caleb has a chance to get the PR as close as possible to the finish line as possible. If 9810 already addresses some comments here, maybe those pieces can be applied to this PR (also reduces merge conflicts).
I'm not sure what's the best approach, inclined to leave the decision up to @devavret and @calebwin .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @bdice @hyperbolic2346 @devavret @vuule for reviews and comments. I just ran into a subtle CUDA bug in this PR when I was in the middle of writing a benchmark. I looked through #9810 and it looks like there are changes in the CUDA code that may handle edge cases I didn't consider here.
So I'm going to go ahead and try to merge #9810 into this and make appropriate changes. I will see if that resolves the issue I came across when benchmarking. I will then try to address other reviews here.
Should I convert this PR to draft in the meanwhile?