Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closing files after pyarrow.parquet read #10965

Closed
kimotorc opened this issue Aug 20, 2021 · 2 comments
Closed

Closing files after pyarrow.parquet read #10965

kimotorc opened this issue Aug 20, 2021 · 2 comments

Comments

@kimotorc
Copy link

When I read in a parquet file using pyarrow.parquet.read_table, the files opened for read don't seem to close.

Is there a way to specifically close these opened files?

I looked at pyarrow.parquet.ParquetDataset as well and there doesn't seem to be a way to force the closure of the files opened for read.

Here's my use case:
I have a custom fsspec filesystem that I've created to interface with an S3 like API. When calling open the filesystem downloads the remote file locally and returns a custom file handle like this one.

It looks for exit or close to clean up the local file which doesn't ever happen after reading in with pyarrow.parquet.

@jorisvandenbossche
Copy link
Member

The ParquetDataset uses ParquetDatasetPiece to read the file(s), and that in turn uses pq.ParquetFile to open and read the actual Parquet file. And indeed, it seems we open but never close the file there. ParquetDatasetPiece also has an open_file_func which uses the open method from the filesystem, but also here this is used as a plain function (and not in a with context that would also close the file again) and we never explicitly close the file.

For pq.read_table, which version of pyarrow are you using? (or can you try with use_legacy_dataset=False to ensure to check with the newer implementation that doesn't use pq.ParquetFile)

I see the issue when directly using pq.ParquetFile (checking that the file is still opened with lsof), but I don't directly see it with pq.read_table or pq.ParquetDataset using a simple example.

To further discuss this, would you like to open a JIRA? (https://issues.apache.org/jira/projects/ARROW/issues/, which we use for tracking bug reports / feature requests)

@kimotorc
Copy link
Author

Thanks, I've filed a JIRA (ARROW-13763) with some minimal code attached to demonstrate how the files are closed.

I've tried with both use_legacy_dataset=False and True. In both cases the files opened for read are never explicitly closed (called with close() or __exit__()).

Though, they don't stay open when using lsof. I'm pretty sure the python gc is cleaning up the opened files.

Currently I have a workaround by adding a __del__() in my filesystem file object to catch these. But I'm reading that the use of __del__ is discouraged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants