Add partitioning support to Parquet chunked writer #10000

devavret · 2022-01-07T19:53:10Z

Chunked writer (class ParquetWriter) now takes an argument partition_cols. For each call to write_table(df), the df is partitioned and the parts are appended to the same corresponding file in the dataset directory. This can be used when partitioning is desired but when one wants to avoid making many small files in each sub directory e.g.
Instead of repeated call to write_to_dataset like so:

write_to_dataset(df1, root_path, partition_cols=['group'])
write_to_dataset(df2, root_path, partition_cols=['group'])
...

which will yield the following structure

root_dir/
  group=value1/
    <uuid1>.parquet
    <uuid2>.parquet
    ...
  group=value2/
    <uuid1>.parquet
    <uuid2>.parquet
    ...
  ...

One can write with

pw = ParquetWriter(root_path, partition_cols=['group'])
pw.write_table(df1)
pw.write_table(df2)
pw.close()

to get the structure

root_dir/
  group=value1/
    <uuid1>.parquet
  group=value2/
    <uuid1>.parquet
  ...

Closes #7196
Also workaround fixes
fixes #9216
fixes #7011

TODO:

Tests

…en a previour write(table) call failed

…ith function

Exception raised by writer needs to be same as that raised by pandas. If user_data is constructed earlier using pyarrow then the exception is raised early and is different

This reverts commit d0de9a9.

codecov · 2022-01-11T13:08:56Z

Codecov Report

Merging #10000 (6552fbe) into branch-22.02 (967a333) will decrease coverage by 0.09%.
The diff coverage is n/a.

@@               Coverage Diff                @@
##           branch-22.02   #10000      +/-   ##
================================================
- Coverage         10.49%   10.39%   -0.10%     
================================================
  Files               119      119              
  Lines             20305    20535     +230     
================================================
+ Hits               2130     2134       +4     
- Misses            18175    18401     +226

Impacted Files	Coverage Δ
python/custreamz/custreamz/kafka.py	`29.16% <0.00%> (-0.63%)`	⬇️
python/dask_cudf/dask_cudf/sorting.py	`92.30% <0.00%> (-0.61%)`	⬇️
python/cudf/cudf/__init__.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/frame.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/index.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/io/parquet.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/series.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/utils/utils.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/utils/dtypes.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/utils/ioutils.py	`0.00% <0.00%> (ø)`
... and 20 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a43682e...6552fbe. Read the comment docs.

python/cudf/cudf/io/parquet.py

python/cudf/cudf/_lib/cpp/io/parquet.pxd

python/cudf/cudf/io/parquet.py

python/cudf/cudf/tests/test_parquet.py

shwina · 2022-01-12T13:19:00Z

python/cudf/cudf/io/parquet.py

+    def __del__(self):
+        self.close()


I'll add the usual comment about __del__ that it's not always guaranteed to be called, even at interpreter shutdown: https://docs.python.org/3/reference/datamodel.html#object.del

There are lots of warnings in SO not to use __del__ as a destructor. The suggestion is to use __enter__() and __exit__() methods instead, the latter of which will call close().

This changes the API, but guarantees that the close() method will be called:

with ParquetDatasetWriter(...) as pq_writer: pq_writer.write_table(df1) pq_writer.write_table(df2) # pq_writer.close() will be called upon exiting the `with` statement

If we don't care that the close() method may not always be called, I think using __del__ is OK, and I've seen it done in other parts of RAPIDS. If it's absolutely essential that all ParquetDatasetWriter objects be close'd, we might want to reconsider.

If it's absolutely essential that all ParquetDatasetWriter objects be close'd

It is.

Does cython's __dealloc__ have the same limitation or is it more like the c++ destructor? I ask this because the non-partitioned ParquetWriter uses that.

I've never been able to find a reliable reference on whether __dealloc__ (which translates into the type's tp_dealloc method) has the same limitations. My understanding is that it does.

However, now that you mention it, I'm seeing that ParquetWriter is doing something explicitly unsafe, i.e., calling a Python method in its __dealloc__ method. See here for why not to do that. The Cython docs here suggest just using __del__ (which translates into the type's tp_finalize method) in those situations instead...

To be frank, I don't know about situations in which __del__ or __dealloc__ may not be called. We rely on __dealloc__ within RAPIDS to free memory allocated in C++ and it has worked well. In any rare cases where it may not work, we wouldn't run into an error, "just" a memory leak. However, the ParquetWriter case may be different, where we need __dealloc__/__del__ to work as expected for correctness.

edit: fixed a link

My knowledge of the del/dealloc methods aligns with what @shwina is saying here. I think that switching this API to a context manager (with …) that uses a try/finally to ensure the file is closed is the only way to guarantee the closure, unless you are willing to put it on the user as a requirement to explicitly call a “close” method.

The only user of ParquetWriter known to me is nvtabular and they wrap it in another ThreadedWriter class which does not have a destructor at all, but does have a close(). Their close() calls ParquetWriter's close().

So should we just remove the destructor altogether?
And then should we add __enter__() and __exit__()? How does that fare as python class design? Having two ways to use it.

It's pretty standard. The file object in Python offers a similar interface, where you can explicitly close it with close, or use it within a with statement which will call close for you.

Calling close multiple times should be allowed, where subsequent calls to close() do nothing.

So should we just remove the destructor altogether?

Yes, as long as we document that this class must either be used within a context manager (with statement), or otherwise, that it's the user's responsibility to call close() explicitly . This can be documented with examples, as part of the class docstring.

devavret · 2022-01-13T00:07:40Z

rerun tests

vyasr

I have a couple of small suggestions but nothing major. There's one thing I'd like to understand about the metadata writing, and after that I'm happy to approve and let you decide whether/how to address my other comments.

python/cudf/cudf/_lib/parquet.pyx

python/cudf/cudf/io/parquet.py

- Replace part_info generator with numpy roll - Add examples to docs

devavret · 2022-01-14T00:59:11Z

@gpucibot merge

devavret added 30 commits October 23, 2021 03:26

First working version of partitioned write

7ca6570

Merge branch 'branch-22.02' into parq-partitioned-write

80e03a4

multiple sink API

21dc54b

partitions in write parquet API

d947abd

Fix a bug in frag causing incorrect num rows

360bf87

Merge branch 'branch-22.02' into parq-partitioned-write

942dd58

Dict encoding changes. Dict kernels now use frags

d454507

API cleanups

b2b44a6

Add a gtest and fix other tests by handling no partition case

0b6d33f

Add a guard to protect from an exception being thrown in impl dtor wh…

2beed73

…en a previour write(table) call failed

Add per-sink user_data in table_input_metadata

4e21e99

Cleanups in dict code and replace index translating while LIST loop w…

e0d1f33

…ith function

fix the returned metadata blob on close

54de724

Revert to using meta ctor without user_data in pyx

aa45827

Exception raised by writer needs to be same as that raised by pandas. If user_data is constructed earlier using pyarrow then the exception is raised early and is different

Remove num_rows param and docs cleanup

06b2643

orc use table meta ctor with single user_data

fffb41e

Small size_type cleanups

ecd3aa5

Misc cleanups

950f505

Merge branch 'branch-22.02' into parq-partitioned-write

019ac25

API changes

1d55d1a

Take user_data out of table_input_metadata

387c2ac

python changes for moving user_data

9a77f5e

Add checks for sizes of options in case of multiple sinks

dc157e1

Merge branch 'branch-22.02' into parq-partitioned-write

79639ee

bug in tests in init list for kv meta

8c2927d

Prevent setting chunk paths if not specified

200d1b0

Make returned metadata blob optional

d0de9a9

Make sink info members private

d90245f

Revert "Make returned metadata blob optional"

5a17a4c

This reverts commit d0de9a9.

make source data members private

2e1c359

devavret requested a review from benfred January 11, 2022 11:27

More mypy fix

b717373

devavret commented Jan 11, 2022

View reviewed changes

python/cudf/cudf/io/parquet.py Outdated Show resolved Hide resolved

devavret commented Jan 11, 2022

View reviewed changes

python/cudf/cudf/io/parquet.py Outdated Show resolved Hide resolved

shwina reviewed Jan 11, 2022

View reviewed changes

python/cudf/cudf/io/parquet.py Outdated Show resolved Hide resolved

python/cudf/cudf/io/parquet.py Outdated Show resolved Hide resolved

vyasr requested changes Jan 11, 2022

View reviewed changes

devavret added 2 commits January 12, 2022 03:44

Review fixes

7278f4f

1 more review change

4b0efe5

shwina reviewed Jan 11, 2022

View reviewed changes

python/cudf/cudf/tests/test_parquet.py Outdated Show resolved Hide resolved

Add returned meta test

ef69e62

devavret requested review from shwina and vyasr January 12, 2022 00:03

devavret added 2 commits January 12, 2022 15:23

mypy fix

fc4a6df

Another style fix

e1d608a

shwina reviewed Jan 12, 2022

View reviewed changes

devavret added 2 commits January 12, 2022 19:04

Merge branch 'branch-22.02' into chunked-partitioned-parq-write

64aae8d

Remove destructor in favour of contextlib

70612e6

vyasr reviewed Jan 13, 2022

View reviewed changes

python/cudf/cudf/_lib/parquet.pyx Show resolved Hide resolved

python/cudf/cudf/io/parquet.py Outdated Show resolved Hide resolved

python/cudf/cudf/io/parquet.py Outdated Show resolved Hide resolved

Review changes

6552fbe

- Replace part_info generator with numpy roll - Add examples to docs

devavret requested review from vyasr and shwina January 13, 2022 21:13

vyasr approved these changes Jan 13, 2022

View reviewed changes

shwina approved these changes Jan 13, 2022

View reviewed changes

rapids-bot bot merged commit 1eceaed into rapidsai:branch-22.02 Jan 14, 2022

GregoryKimball mentioned this pull request Jun 27, 2022

[FEA] Support partitioning directories in reading / writing parquet #2845

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add partitioning support to Parquet chunked writer #10000

Add partitioning support to Parquet chunked writer #10000

devavret commented Jan 7, 2022 •

edited

Loading

codecov bot commented Jan 11, 2022 •

edited

Loading

shwina Jan 12, 2022 •

edited

Loading

devavret Jan 12, 2022

shwina Jan 12, 2022 •

edited

Loading

bdice Jan 12, 2022

devavret Jan 12, 2022

shwina Jan 12, 2022

shwina Jan 12, 2022

devavret commented Jan 13, 2022

vyasr left a comment

devavret commented Jan 14, 2022

Add partitioning support to Parquet chunked writer #10000

Add partitioning support to Parquet chunked writer #10000

Conversation

devavret commented Jan 7, 2022 • edited Loading

codecov bot commented Jan 11, 2022 • edited Loading

Codecov Report

shwina Jan 12, 2022 • edited Loading

Choose a reason for hiding this comment

devavret Jan 12, 2022

Choose a reason for hiding this comment

shwina Jan 12, 2022 • edited Loading

Choose a reason for hiding this comment

bdice Jan 12, 2022

Choose a reason for hiding this comment

devavret Jan 12, 2022

Choose a reason for hiding this comment

shwina Jan 12, 2022

Choose a reason for hiding this comment

shwina Jan 12, 2022

Choose a reason for hiding this comment

devavret commented Jan 13, 2022

vyasr left a comment

Choose a reason for hiding this comment

devavret commented Jan 14, 2022

devavret commented Jan 7, 2022 •

edited

Loading

codecov bot commented Jan 11, 2022 •

edited

Loading

shwina Jan 12, 2022 •

edited

Loading

shwina Jan 12, 2022 •

edited

Loading