Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add partitioning support to Parquet chunked writer #10000

Merged

Conversation

devavret
Copy link
Contributor

@devavret devavret commented Jan 7, 2022

Chunked writer (class ParquetWriter) now takes an argument partition_cols. For each call to write_table(df), the df is partitioned and the parts are appended to the same corresponding file in the dataset directory. This can be used when partitioning is desired but when one wants to avoid making many small files in each sub directory e.g.
Instead of repeated call to write_to_dataset like so:

write_to_dataset(df1, root_path, partition_cols=['group'])
write_to_dataset(df2, root_path, partition_cols=['group'])
...

which will yield the following structure

root_dir/
  group=value1/
    <uuid1>.parquet
    <uuid2>.parquet
    ...
  group=value2/
    <uuid1>.parquet
    <uuid2>.parquet
    ...
  ...

One can write with

pw = ParquetWriter(root_path, partition_cols=['group'])
pw.write_table(df1)
pw.write_table(df2)
pw.close()

to get the structure

root_dir/
  group=value1/
    <uuid1>.parquet
  group=value2/
    <uuid1>.parquet
  ...

Closes #7196
Also workaround fixes
fixes #9216
fixes #7011

TODO:

  • Tests

Exception raised by writer needs to be same as that raised by pandas.
If user_data is constructed earlier using pyarrow then the exception is
raised early and is different
@devavret devavret requested a review from benfred January 11, 2022 11:27
@codecov
Copy link

codecov bot commented Jan 11, 2022

Codecov Report

Merging #10000 (6552fbe) into branch-22.02 (967a333) will decrease coverage by 0.09%.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff                @@
##           branch-22.02   #10000      +/-   ##
================================================
- Coverage         10.49%   10.39%   -0.10%     
================================================
  Files               119      119              
  Lines             20305    20535     +230     
================================================
+ Hits               2130     2134       +4     
- Misses            18175    18401     +226     
Impacted Files Coverage Δ
python/custreamz/custreamz/kafka.py 29.16% <0.00%> (-0.63%) ⬇️
python/dask_cudf/dask_cudf/sorting.py 92.30% <0.00%> (-0.61%) ⬇️
python/cudf/cudf/__init__.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/frame.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/index.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/parquet.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/series.py 0.00% <0.00%> (ø)
python/cudf/cudf/utils/utils.py 0.00% <0.00%> (ø)
python/cudf/cudf/utils/dtypes.py 0.00% <0.00%> (ø)
python/cudf/cudf/utils/ioutils.py 0.00% <0.00%> (ø)
... and 20 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a43682e...6552fbe. Read the comment docs.

python/cudf/cudf/io/parquet.py Outdated Show resolved Hide resolved
python/cudf/cudf/io/parquet.py Outdated Show resolved Hide resolved
python/cudf/cudf/_lib/cpp/io/parquet.pxd Outdated Show resolved Hide resolved
python/cudf/cudf/io/parquet.py Outdated Show resolved Hide resolved
python/cudf/cudf/io/parquet.py Outdated Show resolved Hide resolved
python/cudf/cudf/io/parquet.py Outdated Show resolved Hide resolved
python/cudf/cudf/io/parquet.py Outdated Show resolved Hide resolved
python/cudf/cudf/io/parquet.py Outdated Show resolved Hide resolved
python/cudf/cudf/io/parquet.py Outdated Show resolved Hide resolved
python/cudf/cudf/io/parquet.py Outdated Show resolved Hide resolved
python/cudf/cudf/io/parquet.py Outdated Show resolved Hide resolved
python/cudf/cudf/io/parquet.py Outdated Show resolved Hide resolved
@devavret devavret requested review from shwina and vyasr January 12, 2022 00:03
Comment on lines 953 to 954
def __del__(self):
self.close()
Copy link
Contributor

@shwina shwina Jan 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add the usual comment about __del__ that it's not always guaranteed to be called, even at interpreter shutdown: https://docs.python.org/3/reference/datamodel.html#object.del

There are lots of warnings in SO not to use __del__ as a destructor. The suggestion is to use __enter__() and __exit__() methods instead, the latter of which will call close().

This changes the API, but guarantees that the close() method will be called:

with ParquetDatasetWriter(...) as pq_writer:
    pq_writer.write_table(df1)
    pq_writer.write_table(df2)
# pq_writer.close() will be called upon exiting the `with` statement

If we don't care that the close() method may not always be called, I think using __del__ is OK, and I've seen it done in other parts of RAPIDS. If it's absolutely essential that all ParquetDatasetWriter objects be close'd, we might want to reconsider.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's absolutely essential that all ParquetDatasetWriter objects be close'd

It is.

Does cython's __dealloc__ have the same limitation or is it more like the c++ destructor? I ask this because the non-partitioned ParquetWriter uses that.

Copy link
Contributor

@shwina shwina Jan 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've never been able to find a reliable reference on whether __dealloc__ (which translates into the type's tp_dealloc method) has the same limitations. My understanding is that it does.

However, now that you mention it, I'm seeing that ParquetWriter is doing something explicitly unsafe, i.e., calling a Python method in its __dealloc__ method. See here for why not to do that. The Cython docs here suggest just using __del__ (which translates into the type's tp_finalize method) in those situations instead...

To be frank, I don't know about situations in which __del__ or __dealloc__ may not be called. We rely on __dealloc__ within RAPIDS to free memory allocated in C++ and it has worked well. In any rare cases where it may not work, we wouldn't run into an error, "just" a memory leak. However, the ParquetWriter case may be different, where we need __dealloc__/__del__ to work as expected for correctness.

edit: fixed a link

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My knowledge of the del/dealloc methods aligns with what @shwina is saying here. I think that switching this API to a context manager (with …) that uses a try/finally to ensure the file is closed is the only way to guarantee the closure, unless you are willing to put it on the user as a requirement to explicitly call a “close” method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only user of ParquetWriter known to me is nvtabular and they wrap it in another ThreadedWriter class which does not have a destructor at all, but does have a close(). Their close() calls ParquetWriter's close().

So should we just remove the destructor altogether?
And then should we add __enter__() and __exit__()? How does that fare as python class design? Having two ways to use it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's pretty standard. The file object in Python offers a similar interface, where you can explicitly close it with close, or use it within a with statement which will call close for you.

Calling close multiple times should be allowed, where subsequent calls to close() do nothing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So should we just remove the destructor altogether?

Yes, as long as we document that this class must either be used within a context manager (with statement), or otherwise, that it's the user's responsibility to call close() explicitly . This can be documented with examples, as part of the class docstring.

@devavret
Copy link
Contributor Author

rerun tests

Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a couple of small suggestions but nothing major. There's one thing I'd like to understand about the metadata writing, and after that I'm happy to approve and let you decide whether/how to address my other comments.

python/cudf/cudf/_lib/parquet.pyx Show resolved Hide resolved
python/cudf/cudf/io/parquet.py Outdated Show resolved Hide resolved
python/cudf/cudf/io/parquet.py Outdated Show resolved Hide resolved
- Replace part_info generator with numpy roll
- Add examples to docs
@devavret devavret requested review from vyasr and shwina January 13, 2022 21:13
@devavret
Copy link
Contributor Author

@gpucibot merge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team cuIO cuIO issue feature request New feature or request non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
4 participants