You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It appears that files opened for read using pyarrow.parquet.read_table (and therefore pyarrow.parquet.ParquetDataset) are not explicitly closed.
This seems to be the case for both use_legacy_dataset=True and False. The files don't remain open at the os level (verified using lsof). They do however seem to rely on the python gc to close.
My use case is that i'd like to use a custom fsspec filesystem that interfaces to an s3 like API. It handles the remote download of the parquet file and passes to pyarrow a handle of a temporary file downloaded locally. It then is looking for an explicit close() or exit() to then clean up the temp file.
Antoine Pitrou / @pitrou:
Thanks for the report. It seems that, when a file or directory path is given (as opposed to an open file object), Arrow should explicitly close all files it opens by itself.
Some of this may be in the C++ dataset layer, some of this in the Python Parquet wrapper.
In pyarrow.parquet.ParquetFile, we indeed don't close the file or have a close method to do this. The parquet reader seems to get RandomAccessFile handle created with ReadableFile to open the file (through creating a OSFile). The C++ ReadableFile also doesn't seem to have a public method to close it (there is a private DoClose, should that be made public so layers higher up can ensure to close the ReadableFile after using it?)
It appears that files opened for read using pyarrow.parquet.read_table (and therefore pyarrow.parquet.ParquetDataset) are not explicitly closed.
This seems to be the case for both use_legacy_dataset=True and False. The files don't remain open at the os level (verified using lsof). They do however seem to rely on the python gc to close.
My use case is that i'd like to use a custom fsspec filesystem that interfaces to an s3 like API. It handles the remote download of the parquet file and passes to pyarrow a handle of a temporary file downloaded locally. It then is looking for an explicit close() or exit() to then clean up the temp file.
Environment: fsspec 2021.4.0
Reporter: Richard Kimoto
Assignee: Miles Granger / @milesgranger
Related issues:
Original Issue Attachments:
PRs and other links:
Note: This issue was originally created as ARROW-13763. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: