-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-13763: [Python] Close files in ParquetFile & ParquetDatasetPiece #13821
ARROW-13763: [Python] Close files in ParquetFile & ParquetDatasetPiece #13821
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR!
(haven't yet looked at the tests)
One additional complication I was thinking of: get_reader
/ get_native_file
(and thus ParquetReader / ParquetFile) can actually also accept an open file object, in which case this gets wrapped in a PythonFile
object.
But in the case the user passes an file object, we should maybe not close it? (while such a PythonFile object will pass through a close()
call to the underlying file object)
Using a small example to illustrate:
file_handle = open("test.parquet", "rb")
with ParquetFile(file_handle) as pf:
table = pf.read()
# should this pass?
assert file_handle.closed is False
So this would imply calling p = pq.ParquetFile(buf)
table = p.read()
assert buf.closed is False
p.close() # In the situation of a passed in open file object, this call would mean nothing, right?
assert buf.closed is False |
Along with future test when ParquetDataset accepts file-like obj
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates! A few comments on the tests (note they are technically fine, just some comments to ensure they are more consistent with how other tests are written)
[skip ci] Co-authored-by: Antoine Pitrou <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, thanks @milesgranger . Let's wait for CI.
Has that issue been opened already? |
It seems ARROW-16421 might cross over enough with the intended issue. See recent comments there on whether explicit closing ought to be done in C++. |
The windows failure is happening on master and other PRs as well. |
Benchmark runs are scheduled for baseline = b0422e5 and contender = 951663a. 951663a is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
['Python', 'R'] benchmarks have high level of regressions. |
apache#13821) Will fix [ARROW-13763](https://issues.apache.org/jira/browse/ARROW-13763) A separate Jira issue will be made to address closing files in V2 ParquetDataset, which needs to be handled in the C++ layer. Adds context manager to `pq.ParquetFile` to close input file, and ensure reads within `pq.ParquetDataset` and `pq.read_table` are closed. ```python # user opened file-like object will not be closed with open('file.parquet', 'rb') as f: with pq.ParquetFile(f) as p: table = p.read() assert not f.closed # did not inadvertently close the open file assert not p.closed assert not f.closed # parquet context exit didn't close it assert not p.closed # references the input file status assert f.closed # normal context exit close assert p.closed # ... # path-like will be closed upon exit or `ParquetFile.close` with pq.ParquetFile('file.parquet') as p: table = p.read() assert not p.closed assert p.closed ``` Authored-by: Miles Granger <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
Will fix ARROW-13763
A separate Jira issue will be made to address closing files in V2 ParquetDataset, which needs to be handled in the C++ layer.
Adds context manager to
pq.ParquetFile
to close input file, and ensure reads withinpq.ParquetDataset
andpq.read_table
are closed.