Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-13763: [Python] Close files in ParquetFile & ParquetDatasetPiece #13821

Merged
merged 12 commits into from
Aug 17, 2022
Merged

ARROW-13763: [Python] Close files in ParquetFile & ParquetDatasetPiece #13821

merged 12 commits into from
Aug 17, 2022

Conversation

milesgranger
Copy link
Contributor

@milesgranger milesgranger commented Aug 9, 2022

Will fix ARROW-13763

A separate Jira issue will be made to address closing files in V2 ParquetDataset, which needs to be handled in the C++ layer.

Adds context manager to pq.ParquetFile to close input file, and ensure reads within pq.ParquetDataset and pq.read_table are closed.

# user opened file-like object will not be closed
with open('file.parquet', 'rb') as f:
    with pq.ParquetFile(f) as p:
        table = p.read()
        assert not f.closed  # did not inadvertently close the open file
        assert not p.closed
    assert not f.closed      # parquet context exit didn't close it
    assert not p.closed      # references the input file status
assert f.closed              # normal context exit close
assert p.closed              # ...

# path-like will be closed upon exit or `ParquetFile.close`
with pq.ParquetFile('file.parquet') as p:
    table = p.read()
    assert not p.closed
assert p.closed

@github-actions
Copy link

github-actions bot commented Aug 9, 2022

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

(haven't yet looked at the tests)

One additional complication I was thinking of: get_reader / get_native_file (and thus ParquetReader / ParquetFile) can actually also accept an open file object, in which case this gets wrapped in a PythonFile object.
But in the case the user passes an file object, we should maybe not close it? (while such a PythonFile object will pass through a close() call to the underlying file object)

Using a small example to illustrate:

file_handle = open("test.parquet", "rb")
with ParquetFile(file_handle) as pf:
    table = pf.read()

# should this pass?
assert file_handle.closed is False

python/pyarrow/_parquet.pyx Outdated Show resolved Hide resolved
python/pyarrow/parquet/core.py Show resolved Hide resolved
python/pyarrow/parquet/core.py Outdated Show resolved Hide resolved
@milesgranger
Copy link
Contributor Author

But in the case the user passes an file object, we should maybe not close it? (while such a PythonFile object will pass through a close() call to the underlying file object)

So this would imply calling ParquetFile.close would also not close the passed file? So my second example would be this?

p = pq.ParquetFile(buf)
table = p.read()
assert buf.closed is False

p.close()  # In the situation of a passed in open file object, this call would mean nothing, right?
assert buf.closed is False

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates! A few comments on the tests (note they are technically fine, just some comments to ensure they are more consistent with how other tests are written)

python/pyarrow/tests/parquet/test_parquet_file.py Outdated Show resolved Hide resolved
python/pyarrow/tests/parquet/test_parquet_file.py Outdated Show resolved Hide resolved
python/pyarrow/tests/parquet/test_parquet_file.py Outdated Show resolved Hide resolved
python/pyarrow/_parquet.pyx Outdated Show resolved Hide resolved
python/pyarrow/_parquet.pyx Outdated Show resolved Hide resolved
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, thanks @milesgranger . Let's wait for CI.

@pitrou
Copy link
Member

pitrou commented Aug 17, 2022

A separate Jira issue will be made to address closing files in V2 ParquetDataset, which needs to be handled in the C++ layer.

Has that issue been opened already?

@milesgranger
Copy link
Contributor Author

It seems ARROW-16421 might cross over enough with the intended issue. See recent comments there on whether explicit closing ought to be done in C++.

@jorisvandenbossche
Copy link
Member

The windows failure is happening on master and other PRs as well.

@pitrou pitrou merged commit 951663a into apache:master Aug 17, 2022
@milesgranger milesgranger deleted the ARROW-13763_parquet-explicitly-close-files branch August 17, 2022 13:48
@ursabot
Copy link

ursabot commented Aug 17, 2022

Benchmark runs are scheduled for baseline = b0422e5 and contender = 951663a. 951663a is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️3.95% ⬆️0.85%] test-mac-arm
[Failed ⬇️3.01% ⬆️0.27%] ursa-i9-9960x
[Finished ⬇️1.57% ⬆️0.92%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 951663a4 ec2-t3-xlarge-us-east-2
[Finished] 951663a4 test-mac-arm
[Failed] 951663a4 ursa-i9-9960x
[Finished] 951663a4 ursa-thinkcentre-m75q
[Finished] b0422e5f ec2-t3-xlarge-us-east-2
[Failed] b0422e5f test-mac-arm
[Failed] b0422e5f ursa-i9-9960x
[Finished] b0422e5f ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@ursabot
Copy link

ursabot commented Aug 17, 2022

['Python', 'R'] benchmarks have high level of regressions.
test-mac-arm
ursa-i9-9960x

zagto pushed a commit to zagto/arrow that referenced this pull request Oct 7, 2022
apache#13821)

Will fix [ARROW-13763](https://issues.apache.org/jira/browse/ARROW-13763)

A separate Jira issue will be made to address closing files in V2 ParquetDataset, which needs to be handled in the C++ layer. 

Adds context manager to `pq.ParquetFile` to close input file, and ensure reads within `pq.ParquetDataset` and `pq.read_table` are closed.

```python

# user opened file-like object will not be closed
with open('file.parquet', 'rb') as f:
    with pq.ParquetFile(f) as p:
        table = p.read()
        assert not f.closed  # did not inadvertently close the open file
        assert not p.closed
    assert not f.closed      # parquet context exit didn't close it
    assert not p.closed      # references the input file status
assert f.closed              # normal context exit close
assert p.closed              # ...

# path-like will be closed upon exit or `ParquetFile.close`
with pq.ParquetFile('file.parquet') as p:
    table = p.read()
    assert not p.closed
assert p.closed
```

Authored-by: Miles Granger <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants