-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python] Accessing parquet files with parquet.read_table in google cloud storage fails, but works with dataset, works in 16.1.0 fails in 17.0.0 #43574
Comments
Can you share the schema of the file here? |
I suspect your Parquet file has a "source_id" column with type string, see import os
import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
# Setup
os.mkdir("dataset_root")
os.mkdir("dataset_root/source_id=9319")
tbl = pa.table(
pd.DataFrame(
{"source_id": ["9319", "9319", "9319"], "x": np.random.randint(0, 10, 3)}
)
)
pq.write_table(tbl, "dataset_root/source_id=9319/li191r_9319_2023-01-02.parquet")
# This reproduces the issue
pq.read_table("dataset_root/source_id=9319/li191r_9319_2023-01-02.parquet")
# Traceback (most recent call last):
# File "<stdin>", line 1, in <module>
# File "/Users/bryce/Work/GH-43574/venv/lib/python3.12/site-packages/pyarrow/parquet/core.py", line 1793, in read_table
# dataset = ParquetDataset(
# ^^^^^^^^^^^^^^^
# File "/Users/bryce/Work/GH-43574/venv/lib/python3.12/site-packages/pyarrow/parquet/core.py", line 1371, in __init__
# self._dataset = ds.dataset(path_or_paths, filesystem=filesystem,
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# File "/Users/bryce/Work/GH-43574/venv/lib/python3.12/site-packages/pyarrow/dataset.py", line 794, in dataset
# return _filesystem_dataset(source, **kwargs)
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# File "/Users/bryce/Work/GH-43574/venv/lib/python3.12/site-packages/pyarrow/dataset.py", line 486, in _filesystem_dataset
# return factory.finish(schema)
# ^^^^^^^^^^^^^^^^^^^^^^
# File "pyarrow/_dataset.pyx", line 3089, in pyarrow._dataset.DatasetFactory.finish
# File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
# File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
# pyarrow.lib.ArrowTypeError: Unable to merge: Field source_id has incompatible types: string vs dictionary<values=int32, indices=int32, ordered=0> |
I've also confirmed this bug on local filesystem as well as via cloud storage. And a good workaround is to pass |
FWIW we have other files with alphanumerics in that field as well. |
Thanks. Some thoughts:
You have a few workarounds:
Is there a reason why (1) might not work for you? |
I'll chime in here, since I believe this is a behavior change that was unintentionally introduced in pyarrow 17.0 via #39438, specifically the two lines under discussion here: https://github.com/apache/arrow/pull/39438/files#r1469251517 Previously, specifying a single file would always just return the contents of the file; now it will include any hive partition metadata columns that happen to be detected in the path. In particular, this is frustrating because it behaves differently for files on the local filesystem (the file is a Other bits that made this unexpected and/or could be improved if it's considered desirable to keep this change:
|
Describe the bug, including details regarding any error messages, version, and platform.
In pyarrow 17.0.0
When accessing a parquet file using parquet.read_table an incompatible types exception is thrown:
But accessing via dataset works:
When I revert to pyarrow 16.1.0 both methods work:
I've tried using the fs implementation to list the bucket in 17.0.0 and that works fine, I have no idea what is wrong here:
If I download the file locally and open it, it works. This same error also occurs in pandas > 2.0.0 with pandas.read_parquet()
Component(s)
Python
The text was updated successfully, but these errors were encountered: