[Python] Accessing parquet files with parquet.read_table in google cloud storage fails, but works with dataset, works in 16.1.0 fails in 17.0.0 #43574

brokenjacobs · 2024-08-05T23:24:39Z

Describe the bug, including details regarding any error messages, version, and platform.

In pyarrow 17.0.0

When accessing a parquet file using parquet.read_table an incompatible types exception is thrown:

>>> pa.parquet.read_table('gs://****/v1/li191r/ms=2023-01/source_id=9319/li191r_9319_2023-01-02.parquet')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/sjacobs/tmp/venv/lib/python3.12/site-packages/pyarrow/parquet/core.py", line 1793, in read_table
    dataset = ParquetDataset(
              ^^^^^^^^^^^^^^^
  File "/Users/sjacobs/tmp/venv/lib/python3.12/site-packages/pyarrow/parquet/core.py", line 1371, in __init__
    self._dataset = ds.dataset(path_or_paths, filesystem=filesystem,
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sjacobs/tmp/venv/lib/python3.12/site-packages/pyarrow/dataset.py", line 794, in dataset
    return _filesystem_dataset(source, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sjacobs/tmp/venv/lib/python3.12/site-packages/pyarrow/dataset.py", line 486, in _filesystem_dataset
    return factory.finish(schema)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/_dataset.pyx", line 3089, in pyarrow._dataset.DatasetFactory.finish
  File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: Unable to merge: Field source_id has incompatible types: string vs dictionary<values=int32, indices=int32, ordered=0>

But accessing via dataset works:

>>> import pyarrow.dataset as ds
>>> df = ds.dataset('gs://***/v1/li191r/ms=2023-01/source_id=9319/li191r_9319_2023-01-02.parquet').to_table().to_pandas()
df
>>> df
      source_id site_id              readout_time   voltage
0          9319    SJER 2023-01-02 00:00:00+00:00  0.000159
1          9319    SJER 2023-01-02 00:00:01+00:00  0.000159
2          9319    SJER 2023-01-02 00:00:02+00:00  0.000160
3          9319    SJER 2023-01-02 00:00:03+00:00  0.000159
4          9319    SJER 2023-01-02 00:00:04+00:00  0.000157
...         ...     ...                       ...       ...
86395      9319    SJER 2023-01-02 23:59:55+00:00  0.000049
86396      9319    SJER 2023-01-02 23:59:56+00:00  0.000048
86397      9319    SJER 2023-01-02 23:59:57+00:00  0.000049
86398      9319    SJER 2023-01-02 23:59:58+00:00  0.000048
86399      9319    SJER 2023-01-02 23:59:59+00:00  0.000048

[86400 rows x 4 columns]
>>>

When I revert to pyarrow 16.1.0 both methods work:

>>> t = pa.parquet.read_table('gs://***/v1/li191r/ms=2023-01/source_id=9319/li191r_9319_2023-01-02.parquet')
>>> t.to_pandas()
      source_id site_id              readout_time   voltage
0          9319    SJER 2023-01-02 00:00:00+00:00  0.000159
1          9319    SJER 2023-01-02 00:00:01+00:00  0.000159
2          9319    SJER 2023-01-02 00:00:02+00:00  0.000160
3          9319    SJER 2023-01-02 00:00:03+00:00  0.000159
4          9319    SJER 2023-01-02 00:00:04+00:00  0.000157
...         ...     ...                       ...       ...
86395      9319    SJER 2023-01-02 23:59:55+00:00  0.000049
86396      9319    SJER 2023-01-02 23:59:56+00:00  0.000048
86397      9319    SJER 2023-01-02 23:59:57+00:00  0.000049
86398      9319    SJER 2023-01-02 23:59:58+00:00  0.000048
86399      9319    SJER 2023-01-02 23:59:59+00:00  0.000048

[86400 rows x 4 columns]

I've tried using the fs implementation to list the bucket in 17.0.0 and that works fine, I have no idea what is wrong here:

>>> from pyarrow import fs
>>> gcs = fs.GcsFileSystem()
>>> file_list = gcs.get_file_info(fs.FileSelector('***t/v1/li191r/ms=2023-01/source_id=9319/', recursive=False))
>>> file_list
[<FileInfo for '***/v1/li191r/ms=2023-01/source_id=9319/li191r_9319_2023-01-01.parquet': type=FileType.File, size=418556>, <FileInfo for '***/v1/li191r/ms=2023-01/source_id=9319/li191r_9319_2023-01-02.parquet': type=FileType.File, size=401198>,  (and so on) ]

If I download the file locally and open it, it works. This same error also occurs in pandas > 2.0.0 with pandas.read_parquet()

Component(s)

Python

The text was updated successfully, but these errors were encountered:

amoeba · 2024-09-05T01:26:57Z

Can you share the schema of the file here? pa.parquet.read_schema('gs://****/v1/li191r/ms=2023-01/source_id=9319/li191r_9319_2023-01-02.parquet') should be enough.

amoeba · 2024-09-05T01:30:28Z

I suspect your Parquet file has a "source_id" column with type string, see

import os

import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

# Setup
os.mkdir("dataset_root")
os.mkdir("dataset_root/source_id=9319")
tbl = pa.table(
    pd.DataFrame(
        {"source_id": ["9319", "9319", "9319"], "x": np.random.randint(0, 10, 3)}
    )
)
pq.write_table(tbl, "dataset_root/source_id=9319/li191r_9319_2023-01-02.parquet")

# This reproduces the issue
pq.read_table("dataset_root/source_id=9319/li191r_9319_2023-01-02.parquet")
# Traceback (most recent call last):
#   File "<stdin>", line 1, in <module>
#   File "/Users/bryce/Work/GH-43574/venv/lib/python3.12/site-packages/pyarrow/parquet/core.py", line 1793, in read_table
#     dataset = ParquetDataset(
#               ^^^^^^^^^^^^^^^
#   File "/Users/bryce/Work/GH-43574/venv/lib/python3.12/site-packages/pyarrow/parquet/core.py", line 1371, in __init__
#     self._dataset = ds.dataset(path_or_paths, filesystem=filesystem,
#                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#   File "/Users/bryce/Work/GH-43574/venv/lib/python3.12/site-packages/pyarrow/dataset.py", line 794, in dataset
#     return _filesystem_dataset(source, **kwargs)
#            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#   File "/Users/bryce/Work/GH-43574/venv/lib/python3.12/site-packages/pyarrow/dataset.py", line 486, in _filesystem_dataset
#     return factory.finish(schema)
#            ^^^^^^^^^^^^^^^^^^^^^^
#   File "pyarrow/_dataset.pyx", line 3089, in pyarrow._dataset.DatasetFactory.finish
#   File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
#   File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
# pyarrow.lib.ArrowTypeError: Unable to merge: Field source_id has incompatible types: string vs dictionary<values=int32, indices=int32, ordered=0>

brokenjacobs · 2024-09-05T17:20:01Z

Can you share the schema of the file here? pa.parquet.read_schema('gs://****/v1/li191r/ms=2023-01/source_id=9319/li191r_9319_2023-01-02.parquet') should be enough.

source_id: string
site_id: string
readout_time: timestamp[ms, tz=UTC]
voltage: float
kafka_key: string
kakfa_ts_type: uint8
kafka_ts: timestamp[ms]
kafka_partition: uint8
kafka_offset: uint64
kafka_topic: string
ds: string
-- schema metadata --
pandas: '{"index_columns": [], "column_indexes": [], "columns": [{"name":' + 1502

I've also confirmed this bug on local filesystem as well as via cloud storage. And a good workaround is to pass partitioning=none to the read_table call.

brokenjacobs · 2024-09-05T17:20:47Z

FWIW we have other files with alphanumerics in that field as well.

amoeba · 2024-09-05T18:14:41Z

Thanks. Some thoughts:

read_table errors in your original code where ds.dataset does not because (1) read_table defaults to Hive partitioning and ds.dataset doesn't (2) your file contains a source_id field and its file path also includes source_id=X as a component of the path. With partitioned datasets, partition fields are usually omitted from the files themselves and I'm not sure what the behavior should be if the user leaves them in. The current behavior seems to be that the reader ignores the field in the file and trusts the partition field value in the file path.
ds.dataset succeeds because it's defaulting to Directory partitioning so it's totally ignoring the Hive partition scheme in your file path. You can make the ds.dataset call fail if you specify Hive partitioning (though with a slightly different error).

You have a few workarounds:

Remove the source_id field from your Parquet files. This is what I would do.

Manually specify a schema,

schm = pa.schema([pa.field("source_id", pa.string())])
pq.read_table("dataset_root/source_id=9319/li191r_9319_2023-01-02.parquet", schema=schm)

Manually specify partitoining=None

Is there a reason why (1) might not work for you?

bkurtz · 2024-10-30T16:00:55Z

I'll chime in here, since I believe this is a behavior change that was unintentionally introduced in pyarrow 17.0 via #39438, specifically the two lines under discussion here: https://github.com/apache/arrow/pull/39438/files#r1469251517

Previously, specifying a single file would always just return the contents of the file; now it will include any hive partition metadata columns that happen to be detected in the path.

In particular, this is frustrating because it behaves differently for files on the local filesystem (the file is a BufferedReader rather than a path at that point, so hive metadata in the leading portion of the path is not interpreted) vs on cloud storage like AWS or S3.

Other bits that made this unexpected and/or could be improved if it's considered desirable to keep this change:

the behavior change was not noted in the release notes (as it appears to be unexpected); had to consult the source code to find the offending change
pandas uses read_table as its entrypoint into pyarrow
the docs for read_table don't make it particularly clear that calling with partitioning=None is an option

brokenjacobs added the Type: bug label Aug 5, 2024

github-actions bot added the Component: Python label Aug 5, 2024

brokenjacobs changed the title ~~[Python] Accessing parquet files with read_parquet in google cloud storage fails, but works with dataset~~ [Python] Accessing parquet files with read_parquet in google cloud storage fails, but works with dataset, works in 16.1.0 fails in 17.0.0 Aug 6, 2024

tomerm-iguazio mentioned this issue Dec 23, 2024

[Track] Updated mlflow version mlrun/mlrun#6945

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] Accessing parquet files with parquet.read_table in google cloud storage fails, but works with dataset, works in 16.1.0 fails in 17.0.0 #43574

[Python] Accessing parquet files with parquet.read_table in google cloud storage fails, but works with dataset, works in 16.1.0 fails in 17.0.0 #43574

brokenjacobs commented Aug 5, 2024 •

edited

Loading

amoeba commented Sep 5, 2024

amoeba commented Sep 5, 2024

brokenjacobs commented Sep 5, 2024

brokenjacobs commented Sep 5, 2024

amoeba commented Sep 5, 2024

bkurtz commented Oct 30, 2024

[Python] Accessing parquet files with parquet.read_table in google cloud storage fails, but works with dataset, works in 16.1.0 fails in 17.0.0 #43574

[Python] Accessing parquet files with parquet.read_table in google cloud storage fails, but works with dataset, works in 16.1.0 fails in 17.0.0 #43574

Comments

brokenjacobs commented Aug 5, 2024 • edited Loading

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

amoeba commented Sep 5, 2024

amoeba commented Sep 5, 2024

brokenjacobs commented Sep 5, 2024

brokenjacobs commented Sep 5, 2024

amoeba commented Sep 5, 2024

bkurtz commented Oct 30, 2024

brokenjacobs commented Aug 5, 2024 •

edited

Loading