Row-group-level partitioning for Parquet #9849

calebwin · 2021-12-06T22:15:28Z

This PR introduces a row_group_cols parameter for cudf.to_parquet that groups data by the given columns are writes separate columns to separate row groups (row groups are groups of rows within a Parquet file). This is similar to partition_cols except instead of separate groups being written to separate files, separate groups are written to separate row groups within a file.

You should use row_group_cols when you want to partition data on a column but there are too many groups (or combinations of groups if you are partitioning on a secondary column) that would result in too many small files.

What's left:

Initial implementation in cudf
Integration in dask_cudf
Testing, tests
Benchmarking

Notes to reviewers:

There are changes in cudf-cpp, cudf-python, dask-cudf-python.
If dask_cudf.to_parquet accepts partition_on and row_group_cols, should row_group_cols be renamed to row_group_on for API consistency?
Is the way I am launching the populate_chunk_hash_maps and get_dictionary_indices kernels correct?
There is an extra index column written when specifying row_group_cols - is this a bug we should fix before merging? It seems to be related to [1] and [2].

[1] https://issues.apache.org/jira/browse/ARROW-9136
[2] pandas-dev/pandas#34790.

jakirkham · 2021-12-10T16:30:49Z

cc @rjzamora

devavret · 2021-12-10T18:11:08Z

Is the way I am launching the populate_chunk_hash_maps and get_dictionary_indices kernels correct?

Yes. in fact that exactly how I did it in this partitioning PR #9810

cudf/cpp/src/io/parquet/chunk_dict.cu

Line 303 in 200d1b0

dim3 const dim_grid(frags.size().second, frags.size().first);

bdice

Nice feature @calebwin! I have some comments below. I'm not deeply familiar with this code so let me know if anything I suggested seems off base.

bdice · 2021-12-10T18:11:07Z

cpp/src/io/parquet/chunk_dict.cu

+  auto start_row = 0;
+  for (auto i = 0; i < block_x; i++) {
+    start_row += fragments[0][i].num_rows;
+  }


Perhaps these offsets should be pre-computed with a scan and then passed into the kernel? I'm not sure how many row groups we expect. The difference between 10 and 1M would indicate whether this should be a host or device computation.

If we shouldn't use a scan and pass in the precomputed offsets, then this could use std::accumulate. Might look something like this snippet (untested):

Suggested change

auto start_row = 0;

for (auto i = 0; i < block_x; i++) {

start_row += fragments[0][i].num_rows;

}

auto row_counter = thrust::transform_iterator(fragments[0], [] __device__(auto const& page){ return page.num_rows; });

auto start_row = std::accumulate(row_counter[0], row_counter[block_x], 0);

(Note: page might not be the right name for the function argument, I am just guessing from device_2dspan<PageFragment>)

Do you mean like this:

cudf/cpp/src/io/parquet/chunk_dict.cu

Line 107 in f44a50b

size_type start_row = frag.start_row;

@devavret Looks about right! I'm just trying to avoid a loop on each thread when we could use a single-pass scan ahead of time. I see you've worked on this in #9810. That logic should be used here. Does #9810 need to be merged first?

Does #9810 need to be merged first?

Actually, that's what I was wondering just now. #9810 is close to completion and if it is merged first, then there will be many merge issues with this PR. I'm fine with merging #9810 later or taking over this one if it remains unmerged due to merge issues.

cc @quasiben @vuule

@devavret do you have an ETA for addressing current comments on #9810?

My 2cts:
Let's aim to merge this first so Caleb has a chance to get the PR as close as possible to the finish line as possible. If 9810 already addresses some comments here, maybe those pieces can be applied to this PR (also reduces merge conflicts).

I'm not sure what's the best approach, inclined to leave the decision up to @devavret and @calebwin .

Thanks @bdice @hyperbolic2346 @devavret @vuule for reviews and comments. I just ran into a subtle CUDA bug in this PR when I was in the middle of writing a benchmark. I looked through #9810 and it looks like there are changes in the CUDA code that may handle edge cases I didn't consider here.

So I'm going to go ahead and try to merge #9810 into this and make appropriate changes. I will see if that resolves the issue I came across when benchmarking. I will then try to address other reviews here.

Should I convert this PR to draft in the meanwhile?

bdice · 2021-12-10T18:15:06Z

cpp/src/io/parquet/chunk_dict.cu

+  auto start_row = 0;
+  for (auto i = 0; i < block_x; i++) {
+    start_row += fragments[0][i].num_rows;
+  }


Same as previous comment.

bdice · 2021-12-10T18:18:00Z

cpp/src/io/parquet/writer_impl.cu

@@ -20,6 +20,7 @@
 */

 #include <io/statistics/column_statistics.cuh>
+#include "io/parquet/parquet_gpu.hpp"


This should probably use <> braces.

Suggested change

#include "io/parquet/parquet_gpu.hpp"

#include <io/parquet/parquet_gpu.hpp>

bdice · 2021-12-10T18:19:01Z

cpp/src/io/parquet/writer_impl.cu


+#include <iostream>


Looks like this was left in from some print debugging? If it is needed, it should go with the other section of stdlib headers like #include <algorithm> below here, rather than with the rmm includes.

bdice · 2021-12-10T18:22:33Z

cpp/src/io/parquet/writer_impl.cu

  cudf::detail::hostdevice_2dvector<gpu::PageFragment> fragments(
    num_columns, num_fragments, stream);

+  if (row_group_sizes_specified) {
+    // auto fragments_span = host_2dspan<gpu::PageFragment>{fragments};


Why is this commented?

bdice · 2021-12-10T18:42:55Z

python/cudf/cudf/io/parquet.py

+                    write_df.to_parquet(
+                        fil,
+                        index=preserve_index,
+                        row_group_cols=row_group_cols,
+                        **kwargs,
+                    )


I think this section of code can be written to only call to_parquet once. Something roughly like this, which updates **kwargs and optionally keeps the result:

if return_metadata: kwargs["metadata_file_path"] = fs.sep.join([subdir, filename]) metadata_result = write_df.to_parquet( fil, index=preserve_index, row_group_cols=row_group_cols, **kwargs, ) if return_metadata: metadata.append(metadata_result)

bdice · 2021-12-10T18:49:55Z

python/cudf/cudf/tests/test_parquet.py

+    for a, b in zip(col_names, df.columns):
+        assert a == b


Is this equivalent?

Suggested change

for a, b in zip(col_names, df.columns):

assert a == b

assert col_names == df.columns

Aside from being shorter, it is preferable to compare all the column names at once because it produces a nicer error message if it fails.

bdice · 2021-12-10T18:51:18Z

python/cudf/cudf/tests/test_parquet.py

+    num_rows, row_groups, col_names = cudf.io.read_parquet_metadata(fname)
+
+    assert num_rows == len(df.index)
+    assert row_groups == len(row_group_sizes)


This line verifies the number of row groups, but I don't think this test is checking the number of rows in each row group. That seems important to test here.

bdice · 2021-12-10T18:52:55Z

python/cudf/cudf/utils/ioutils.py

+    Column names by which to partition the dataset across row groups in the
+    resulting Parquet file
+    Columns are partitioned in the order they are given


Sentences in docstrings should end in a period. Line breaks should be avoided except to wrap at the column limit.

Suggested change

Column names by which to partition the dataset across row groups in the

resulting Parquet file

Columns are partitioned in the order they are given

Column names by which to partition the dataset across row groups in the

resulting Parquet file. Columns are partitioned in the order they are

given.

bdice · 2021-12-10T18:57:35Z

python/dask_cudf/dask_cudf/io/tests/test_parquet.py

+        tmpdir, partition_on=partition_on, row_group_cols=row_group_cols
+    )
+    ddf_read = dask_cudf.read_parquet(tmpdir)
+    assert_eq(len(ddf), len(ddf_read))


assert_eq is intended for more complicated assertions about dataframes being equivalent. Comparing lengths should be done with a plain assert. However, it is probaby a good idea to make sure the data frame written/read is equivalent to the source dataframe in memory:

Suggested change

assert_eq(len(ddf), len(ddf_read))

assert len(ddf) == len(ddf_read)

assert_eq(ddf, ddf_read)

hyperbolic2346

Thanks for working on this. Some comments. Interested to see this progress.

hyperbolic2346 · 2021-12-10T19:33:04Z

cpp/src/io/parquet/page_enc.cu

+    if (fragment_size != -1) {
+      s->frag.num_rows = min(fragment_size, max_num_rows - min(start_row, max_num_rows));
+    } else {
+      s->frag.num_rows = frag[blockIdx.x][blockIdx.y].num_rows;
+    }


Suggested change

if (fragment_size != -1) {

s->frag.num_rows = min(fragment_size, max_num_rows - min(start_row, max_num_rows));

} else {

s->frag.num_rows = frag[blockIdx.x][blockIdx.y].num_rows;

}

s->frag.num_rows = fragment_size != -1 ? min(fragment_size, max_num_rows - min(start_row, max_num_rows)) : frag[blockIdx.x][blockIdx.y].num_rows;

cpp/src/io/parquet/page_enc.cu

hyperbolic2346 · 2021-12-10T19:35:54Z

cpp/src/io/parquet/writer_impl.cu

+  if (row_group_sizes_specified) {
+    num_fragments = 0;
+    for (std::size_t i = 0; i < row_group_sizes.size(); i++) {
+      num_fragments += (row_group_sizes[i] + max_page_fragment_size - 1) / max_page_fragment_size;


This should be a std::accumulate as Bradley showed above.

github-actions · 2022-01-09T21:03:00Z

This PR has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this PR if it is no longer required. Otherwise, please respond with a comment indicating any updates. This PR will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions · 2022-06-08T19:03:14Z

This PR has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this PR if it is no longer required. Otherwise, please respond with a comment indicating any updates.

vyasr · 2024-01-23T00:57:49Z

I'm going to close this since it's fairly out of date and it's not clear if we still want this as is. Feel free to reopen if work on this restarts.

calebwin added 5 commits December 6, 2021 10:09

Add row_group_sizes parameter to to_parquet

27c785c

Initialize page fragments using row group sizes

c8710e9

Decide rowgroup boundaries based on row_group_sizes

f360fdb

Add row_group_cols parameter to cudf.to_parquet

035912c

Allow write_to_dataset to accept row_group_cols

d3f85de

github-actions bot added Python Affects Python cuDF API. libcudf Affects libcudf (C++/CUDA) code. labels Dec 6, 2021

calebwin added 2 commits December 6, 2021 15:51

Fix compilation issues

e591c28

Fix Python compile-time bug but failing in CUDA code

fcfb2bb

calebwin added 3 commits December 9, 2021 18:41

Fix row_group_sizes in cudf.to_parquet

0654742

Fix issue with row_group_cols in to_parquet

6a86f1b

Fix issue with integration with Dask cuDF

c2a5309

calebwin marked this pull request as ready for review December 10, 2021 04:21

calebwin requested review from a team as code owners December 10, 2021 04:21

calebwin requested review from vyasr, vuule and isVoid December 10, 2021 04:21

Add docs for row_group_cols

6656167

bdice requested changes Dec 10, 2021

View reviewed changes

hyperbolic2346 requested changes Dec 10, 2021

View reviewed changes

github-actions bot added the inactive-30d label Jan 9, 2022

shwina changed the base branch from branch-22.02 to branch-22.04 January 20, 2022 21:15

GregoryKimball assigned devavret Mar 10, 2022

github-actions bot added the inactive-90d label Jun 8, 2022

vyasr closed this Jan 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Row-group-level partitioning for Parquet #9849

Row-group-level partitioning for Parquet #9849

calebwin commented Dec 6, 2021 •

edited

Loading

jakirkham commented Dec 10, 2021

devavret commented Dec 10, 2021

bdice left a comment •

edited

Loading

bdice Dec 10, 2021

devavret Dec 10, 2021

bdice Dec 10, 2021

devavret Dec 10, 2021

vuule Dec 10, 2021

vuule Dec 10, 2021

calebwin Dec 10, 2021

bdice Dec 10, 2021

bdice Dec 10, 2021

bdice Dec 10, 2021

bdice Dec 10, 2021

bdice Dec 10, 2021

bdice Dec 10, 2021 •

edited

Loading

bdice Dec 10, 2021

bdice Dec 10, 2021 •

edited

Loading

bdice Dec 10, 2021

hyperbolic2346 left a comment

hyperbolic2346 Dec 10, 2021

hyperbolic2346 Dec 10, 2021

github-actions bot commented Jan 9, 2022

github-actions bot commented Jun 8, 2022

vyasr commented Jan 23, 2024

	#include "io/parquet/parquet_gpu.hpp"
	#include <io/parquet/parquet_gpu.hpp>

	for a, b in zip(col_names, df.columns):
	assert a == b
	assert col_names == df.columns

	assert_eq(len(ddf), len(ddf_read))
	assert len(ddf) == len(ddf_read)
	assert_eq(ddf, ddf_read)

Row-group-level partitioning for Parquet #9849

Row-group-level partitioning for Parquet #9849

Conversation

calebwin commented Dec 6, 2021 • edited Loading

jakirkham commented Dec 10, 2021

devavret commented Dec 10, 2021

bdice left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bdice Dec 10, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bdice Dec 10, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hyperbolic2346 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Jan 9, 2022

github-actions bot commented Jun 8, 2022

vyasr commented Jan 23, 2024

calebwin commented Dec 6, 2021 •

edited

Loading

bdice left a comment •

edited

Loading

bdice Dec 10, 2021 •

edited

Loading

bdice Dec 10, 2021 •

edited

Loading