-
Notifications
You must be signed in to change notification settings - Fork 233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix writing to local filesystem #301
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the need for this, but I'm reluctant to support this. You could pre-create the directories outside of PyIceberg if possible.
pyiceberg/io/pyarrow.py
Outdated
@@ -288,6 +288,8 @@ def create(self, overwrite: bool = False) -> OutputStream: | |||
try: | |||
if not overwrite and self.exists() is True: | |||
raise FileExistsError(f"Cannot create file, already exists: {self.location}") | |||
# Parent directories must be created first in certain file systems, such as the LocalFileSystem. | |||
self._filesystem.create_dir(os.path.dirname(self._path), recursive=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is typically something that we try to avoid. Iceberg is designed to work with object stores, and those don't have a notion of directories. One recommendation is even to disallow moves and listing of directories. One thing is also creating a directory. I'm not sure what the behavior is for the Arrow S3 implementation. Since some of the implementations still make a call(s):
- For example, a list operation to check if the directory is there
- For example, they touch a small file under the prefix to indicate that the path should be created.
Another concern is that we currently do this in Arrow, we also would need to do this for other implementations to avoid discrepancies. The concept of the FileIO is that you easily can swap them out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Fokko thanks for the review.
I agree with the above. The Arrow FileIO implementation might not be the best place to implement this behavior. So far both of the supported FS implementations (ARROW_FILE_IO
and FSSPEC_FILE_IO
) are failing to write to the local file system.
I want to make writes work for the local file system.
Looking at the Java side, there is a LocalOutputFile
implementation which implements the behavior for creating parent directories.
Maybe we can implement a new FileIO implementation and make that the preferred implementation for the file://
scheme.
iceberg-python/pyiceberg/io/__init__.py
Line 278 in 4cf1f35
"file": [ARROW_FILE_IO], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @kevinjqliu That sounds like a reasonable alternative to me 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking a bit more about it, maybe we should still use the ArrowFileSystem in this new FileIO to make sure that we can read the table into an Arrow pa.Table
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Fokko I updated the PR for both PyArrow and FsSpec implementations
752b63f
to
1c480a5
Compare
5ea2ceb
to
5c77796
Compare
5c77796
to
57d68f2
Compare
@@ -173,6 +173,7 @@ def _adlfs(properties: Properties) -> AbstractFileSystem: | |||
|
|||
|
|||
SCHEME_TO_FS = { | |||
"": _file, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pyarrow defaults scheme to file
when no scheme is present.
iceberg-python/pyiceberg/io/pyarrow.py
Line 322 in 8464d71
return "file", uri.netloc, os.path.abspath(location) |
we essentially do the same here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @kevinjqliu for working on this, elegant solution! LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks @kevinjqliu
Issue #299
This PR changes the behavior of both
PyArrow
andFsSpec
file systems. When writing to the local file system, parent directories will be created first before writing to the file. Previously, aFileNotFoundError
error is thrown when the parent directories are missing.Testing
Added tests to
tests/io/test_fsspec.py
andtests/io/test_pyarrow.py
Added
test_append_table
test to write to the local filesystem, to illustrate the issue. It fails without the changes in this PR.Also added the
catalog_sqlite_fsspec
catalog which tests writing to the local filesystem usingFSSPEC_FILE_IO