Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add partitioning support to Parquet chunked writer #10000
Add partitioning support to Parquet chunked writer #10000
Changes from 80 commits
7ca6570
80e03a4
21dc54b
d947abd
360bf87
942dd58
d454507
b2b44a6
0b6d33f
2beed73
4e21e99
e0d1f33
54de724
aa45827
06b2643
fffb41e
ecd3aa5
950f505
019ac25
1d55d1a
387c2ac
9a77f5e
dc157e1
79639ee
8c2927d
200d1b0
d0de9a9
d90245f
5a17a4c
2e1c359
f44a50b
be8c19a
b9b5c15
85ce42f
1e79453
be83945
e537314
91580b4
c53fea7
0dfcb89
085442c
099093d
2ffec7e
975207e
5dae74d
120f32e
39ec9a0
43d35cb
7d7444a
e66d855
2442be8
48354e9
72b1e81
e84fc1c
96139d4
2ec7d29
f6d4175
833bc88
15557b7
879294c
2bcf62c
cf13e81
7e57eec
48fb111
0513d54
eab23d0
5b00973
b47d38f
9d61a77
ddb95a3
912301f
bb30a09
18c3524
cd698ed
b717373
7278f4f
4b0efe5
ef69e62
fc4a6df
e1d608a
64aae8d
70612e6
6552fbe
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add the usual comment about
__del__
that it's not always guaranteed to be called, even at interpreter shutdown: https://docs.python.org/3/reference/datamodel.html#object.delThere are lots of warnings in SO not to use
__del__
as a destructor. The suggestion is to use__enter__()
and__exit__()
methods instead, the latter of which will callclose()
.This changes the API, but guarantees that the
close()
method will be called:If we don't care that the
close()
method may not always be called, I think using__del__
is OK, and I've seen it done in other parts of RAPIDS. If it's absolutely essential that allParquetDatasetWriter
objects beclose
'd, we might want to reconsider.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is.
Does cython's
__dealloc__
have the same limitation or is it more like the c++ destructor? I ask this because the non-partitionedParquetWriter
uses that.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've never been able to find a reliable reference on whether
__dealloc__
(which translates into the type's tp_dealloc method) has the same limitations. My understanding is that it does.However, now that you mention it, I'm seeing that
ParquetWriter
is doing something explicitly unsafe, i.e., calling a Python method in its__dealloc__
method. See here for why not to do that. The Cython docs here suggest just using__del__
(which translates into the type's tp_finalize method) in those situations instead...To be frank, I don't know about situations in which
__del__
or__dealloc__
may not be called. We rely on__dealloc__
within RAPIDS to free memory allocated in C++ and it has worked well. In any rare cases where it may not work, we wouldn't run into an error, "just" a memory leak. However, theParquetWriter
case may be different, where we need__dealloc__
/__del__
to work as expected for correctness.edit: fixed a link
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My knowledge of the del/dealloc methods aligns with what @shwina is saying here. I think that switching this API to a context manager (
with …
) that uses a try/finally to ensure the file is closed is the only way to guarantee the closure, unless you are willing to put it on the user as a requirement to explicitly call a “close” method.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only user of
ParquetWriter
known to me is nvtabular and they wrap it in anotherThreadedWriter
class which does not have a destructor at all, but does have aclose()
. Theirclose()
callsParquetWriter
'sclose()
.So should we just remove the destructor altogether?
And then should we add
__enter__()
and__exit__()
? How does that fare as python class design? Having two ways to use it.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's pretty standard. The
file
object in Python offers a similar interface, where you can explicitly close it withclose
, or use it within awith
statement which will callclose
for you.Calling
close
multiple times should be allowed, where subsequent calls toclose()
do nothing.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, as long as we document that this class must either be used within a context manager (
with
statement), or otherwise, that it's the user's responsibility to callclose()
explicitly . This can be documented with examples, as part of the class docstring.