Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix writing to local filesystem #301

Merged
merged 5 commits into from
Jan 29, 2024

Conversation

kevinjqliu
Copy link
Contributor

@kevinjqliu kevinjqliu commented Jan 24, 2024

Issue #299

This PR changes the behavior of both PyArrow and FsSpec file systems. When writing to the local file system, parent directories will be created first before writing to the file. Previously, a FileNotFoundError error is thrown when the parent directories are missing.

Testing

Added tests to tests/io/test_fsspec.py and tests/io/test_pyarrow.py
Added test_append_table test to write to the local filesystem, to illustrate the issue. It fails without the changes in this PR.
Also added the catalog_sqlite_fsspec catalog which tests writing to the local filesystem using FSSPEC_FILE_IO

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the need for this, but I'm reluctant to support this. You could pre-create the directories outside of PyIceberg if possible.

@@ -288,6 +288,8 @@ def create(self, overwrite: bool = False) -> OutputStream:
try:
if not overwrite and self.exists() is True:
raise FileExistsError(f"Cannot create file, already exists: {self.location}")
# Parent directories must be created first in certain file systems, such as the LocalFileSystem.
self._filesystem.create_dir(os.path.dirname(self._path), recursive=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is typically something that we try to avoid. Iceberg is designed to work with object stores, and those don't have a notion of directories. One recommendation is even to disallow moves and listing of directories. One thing is also creating a directory. I'm not sure what the behavior is for the Arrow S3 implementation. Since some of the implementations still make a call(s):

  • For example, a list operation to check if the directory is there
  • For example, they touch a small file under the prefix to indicate that the path should be created.

Another concern is that we currently do this in Arrow, we also would need to do this for other implementations to avoid discrepancies. The concept of the FileIO is that you easily can swap them out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Fokko thanks for the review.

I agree with the above. The Arrow FileIO implementation might not be the best place to implement this behavior. So far both of the supported FS implementations (ARROW_FILE_IO and FSSPEC_FILE_IO) are failing to write to the local file system.

I want to make writes work for the local file system.

Looking at the Java side, there is a LocalOutputFile implementation which implements the behavior for creating parent directories.

Maybe we can implement a new FileIO implementation and make that the preferred implementation for the file:// scheme.

"file": [ARROW_FILE_IO],

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @kevinjqliu That sounds like a reasonable alternative to me 👍

Copy link
Contributor

@Fokko Fokko Jan 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking a bit more about it, maybe we should still use the ArrowFileSystem in this new FileIO to make sure that we can read the table into an Arrow pa.Table.

Copy link
Contributor Author

@kevinjqliu kevinjqliu Jan 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Fokko I updated the PR for both PyArrow and FsSpec implementations

@kevinjqliu kevinjqliu force-pushed the kevinjqliu/fix-write-local-fs branch from 752b63f to 1c480a5 Compare January 26, 2024 01:20
@kevinjqliu kevinjqliu force-pushed the kevinjqliu/fix-write-local-fs branch from 5ea2ceb to 5c77796 Compare January 26, 2024 02:12
@kevinjqliu kevinjqliu force-pushed the kevinjqliu/fix-write-local-fs branch from 5c77796 to 57d68f2 Compare January 26, 2024 02:23
@@ -173,6 +173,7 @@ def _adlfs(properties: Properties) -> AbstractFileSystem:


SCHEME_TO_FS = {
"": _file,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pyarrow defaults scheme to file when no scheme is present.

return "file", uri.netloc, os.path.abspath(location)

we essentially do the same here

@kevinjqliu kevinjqliu requested a review from Fokko January 26, 2024 02:35
Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kevinjqliu for working on this, elegant solution! LGTM

Copy link
Contributor

@HonahX HonahX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks @kevinjqliu

@Fokko Fokko merged commit f66e365 into apache:main Jan 29, 2024
6 checks passed
@Fokko Fokko added this to the PyIceberg 0.6.0 release milestone Jan 29, 2024
@kevinjqliu kevinjqliu deleted the kevinjqliu/fix-write-local-fs branch January 30, 2024 02:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants