Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

to_parquet with path not ending in a slash writes to a file instead of a directory since v36 #9684

Closed
progval opened this issue Mar 18, 2024 · 3 comments · Fixed by #13079
Closed
Assignees
Labels
bug Something isn't working

Comments

@progval
Copy link
Contributor

progval commented Mar 18, 2024

Describe the bug

Dataframe::to_parquet takes a path as string, and used to create a directory and write files to it, even if it did not end with /. Starting with v36, it writes to a single file if the path does not end with /.

To Reproduce

(using the Python library)

from pathlib import Path
import shutil

import datafusion
import pyarrow.dataset

target_path = Path("/tmp/") / "dataset"

# clean up any existing data at /tmp/dataset
if target_path.is_file():
    target_path.unlink()
elif target_path.is_dir():
    shutil.rmtree(target_path)
assert not target_path.exists()

ctx = datafusion.SessionContext()

ctx.sql("SELECT 1").write_parquet(str(target_path))

assert target_path.exists()

if target_path.is_file():
    print(f"{target_path} is a file")
else:
    print(
        f"{target_path} is a directory with entries:",
        ", ".join(map(str, target_path.iterdir()))
    )

# clean up /tmp/dataset
if target_path.is_file():
    target_path.unlink()
elif target_path.is_dir():
    shutil.rmtree(target_path)
assert not target_path.exists()

v36 behavior:

/tmp/dataset is a file

Expected behavior

v35 behavior:

/tmp/dataset is a directory with entries: /tmp/dataset/u6mIvvxwwFbjo17Q_0.parquet

Additional context

It is common to use the pathlib module in Python when working with paths, and it always strips the trailing /.

@progval progval added the bug Something isn't working label Mar 18, 2024
@alamb
Copy link
Contributor

alamb commented Mar 19, 2024

Thank you for the report @progval

I believe this was an intentional change (cc @devinjdangelo) in order to distinguish writing files and parititioned datasets

Given I can see that the default behavior may not be ideal, perhaps we can add a configuration setting that controls how non-existent paths that don't end with / are handled

@devinjdangelo
Copy link
Contributor

Given I can see that the default behavior may not be ideal, perhaps we can add a configuration setting that controls how non-existent paths that don't end with / are handled

We previously had single_file_output which was a statement level config that determined if the path should be treated as a file or directory. We intentionally removed this in #9041 in favor of inference based on the path ending in '/'.

We could add a session level config to control how a path is interpreted as @alamb suggests. Perhaps we could also improve the inference logic by additionally checking for the presence of a valid file extension before concluding a path is a file. E.g.:

  1. tmp/dataset/ -> is a folder since it ends in /
  2. tmp/dataset -> is still a folder since it does not end in / but has no valid file extension
  3. tmp/file.parquet -> is a file since it does not end in / and has a valid file extension .parquet
  4. tmp/file.parquet/ -> is a folder since it ends in /

@dhegberg dhegberg removed their assignment Oct 17, 2024
@dhegberg
Copy link
Contributor

take

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants