You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When I read in a parquet file using pyarrow.parquet.read_table, the files opened for read don't seem to close.
Is there a way to specifically close these opened files?
I looked at pyarrow.parquet.ParquetDataset as well and there doesn't seem to be a way to force the closure of the files opened for read.
Here's my use case:
I have a custom fsspec filesystem that I've created to interface with an S3 like API. When calling open the filesystem downloads the remote file locally and returns a custom file handle like this one.
It looks for exit or close to clean up the local file which doesn't ever happen after reading in with pyarrow.parquet.
The text was updated successfully, but these errors were encountered:
The ParquetDataset uses ParquetDatasetPiece to read the file(s), and that in turn uses pq.ParquetFile to open and read the actual Parquet file. And indeed, it seems we open but never close the file there. ParquetDatasetPiece also has an open_file_func which uses the open method from the filesystem, but also here this is used as a plain function (and not in a with context that would also close the file again) and we never explicitly close the file.
For pq.read_table, which version of pyarrow are you using? (or can you try with use_legacy_dataset=False to ensure to check with the newer implementation that doesn't use pq.ParquetFile)
I see the issue when directly using pq.ParquetFile (checking that the file is still opened with lsof), but I don't directly see it with pq.read_table or pq.ParquetDataset using a simple example.
Thanks, I've filed a JIRA (ARROW-13763) with some minimal code attached to demonstrate how the files are closed.
I've tried with both use_legacy_dataset=False and True. In both cases the files opened for read are never explicitly closed (called with close() or __exit__()).
Though, they don't stay open when using lsof. I'm pretty sure the python gc is cleaning up the opened files.
Currently I have a workaround by adding a __del__() in my filesystem file object to catch these. But I'm reading that the use of __del__ is discouraged.
When I read in a parquet file using
pyarrow.parquet.read_table
, the files opened for read don't seem to close.Is there a way to specifically close these opened files?
I looked at
pyarrow.parquet.ParquetDataset
as well and there doesn't seem to be a way to force the closure of the files opened for read.Here's my use case:
I have a custom fsspec filesystem that I've created to interface with an S3 like API. When calling
open
the filesystem downloads the remote file locally and returns a custom file handle like this one.It looks for exit or close to clean up the local file which doesn't ever happen after reading in with pyarrow.parquet.
The text was updated successfully, but these errors were encountered: