-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add partitioning support to Parquet chunked writer #10000
Add partitioning support to Parquet chunked writer #10000
Conversation
…en a previour write(table) call failed
Exception raised by writer needs to be same as that raised by pandas. If user_data is constructed earlier using pyarrow then the exception is raised early and is different
This reverts commit d0de9a9.
Codecov Report
@@ Coverage Diff @@
## branch-22.02 #10000 +/- ##
================================================
- Coverage 10.49% 10.39% -0.10%
================================================
Files 119 119
Lines 20305 20535 +230
================================================
+ Hits 2130 2134 +4
- Misses 18175 18401 +226
Continue to review full report at Codecov.
|
python/cudf/cudf/io/parquet.py
Outdated
def __del__(self): | ||
self.close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add the usual comment about __del__
that it's not always guaranteed to be called, even at interpreter shutdown: https://docs.python.org/3/reference/datamodel.html#object.del
There are lots of warnings in SO not to use __del__
as a destructor. The suggestion is to use __enter__()
and __exit__()
methods instead, the latter of which will call close()
.
This changes the API, but guarantees that the close()
method will be called:
with ParquetDatasetWriter(...) as pq_writer:
pq_writer.write_table(df1)
pq_writer.write_table(df2)
# pq_writer.close() will be called upon exiting the `with` statement
If we don't care that the close()
method may not always be called, I think using __del__
is OK, and I've seen it done in other parts of RAPIDS. If it's absolutely essential that all ParquetDatasetWriter
objects be close
'd, we might want to reconsider.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's absolutely essential that all
ParquetDatasetWriter
objects beclose
'd
It is.
Does cython's __dealloc__
have the same limitation or is it more like the c++ destructor? I ask this because the non-partitioned ParquetWriter
uses that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've never been able to find a reliable reference on whether __dealloc__
(which translates into the type's tp_dealloc method) has the same limitations. My understanding is that it does.
However, now that you mention it, I'm seeing that ParquetWriter
is doing something explicitly unsafe, i.e., calling a Python method in its __dealloc__
method. See here for why not to do that. The Cython docs here suggest just using __del__
(which translates into the type's tp_finalize method) in those situations instead...
To be frank, I don't know about situations in which __del__
or __dealloc__
may not be called. We rely on __dealloc__
within RAPIDS to free memory allocated in C++ and it has worked well. In any rare cases where it may not work, we wouldn't run into an error, "just" a memory leak. However, the ParquetWriter
case may be different, where we need __dealloc__
/__del__
to work as expected for correctness.
edit: fixed a link
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My knowledge of the del/dealloc methods aligns with what @shwina is saying here. I think that switching this API to a context manager (with …
) that uses a try/finally to ensure the file is closed is the only way to guarantee the closure, unless you are willing to put it on the user as a requirement to explicitly call a “close” method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only user of ParquetWriter
known to me is nvtabular and they wrap it in another ThreadedWriter
class which does not have a destructor at all, but does have a close()
. Their close()
calls ParquetWriter
's close()
.
So should we just remove the destructor altogether?
And then should we add __enter__()
and __exit__()
? How does that fare as python class design? Having two ways to use it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's pretty standard. The file
object in Python offers a similar interface, where you can explicitly close it with close
, or use it within a with
statement which will call close
for you.
Calling close
multiple times should be allowed, where subsequent calls to close()
do nothing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So should we just remove the destructor altogether?
Yes, as long as we document that this class must either be used within a context manager (with
statement), or otherwise, that it's the user's responsibility to call close()
explicitly . This can be documented with examples, as part of the class docstring.
rerun tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a couple of small suggestions but nothing major. There's one thing I'd like to understand about the metadata writing, and after that I'm happy to approve and let you decide whether/how to address my other comments.
- Replace part_info generator with numpy roll - Add examples to docs
@gpucibot merge |
Chunked writer (
class ParquetWriter
) now takes an argumentpartition_cols
. For each call towrite_table(df)
, thedf
is partitioned and the parts are appended to the same corresponding file in the dataset directory. This can be used when partitioning is desired but when one wants to avoid making many small files in each sub directory e.g.Instead of repeated call to
write_to_dataset
like so:which will yield the following structure
One can write with
to get the structure
Closes #7196
Also workaround fixes
fixes #9216
fixes #7011
TODO: