Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Accessing parquet files with parquet.read_table in google cloud storage fails, but works with dataset, works in 16.1.0 fails in 17.0.0 #43574

Open
brokenjacobs opened this issue Aug 5, 2024 · 6 comments

Comments

@brokenjacobs
Copy link

brokenjacobs commented Aug 5, 2024

Describe the bug, including details regarding any error messages, version, and platform.

In pyarrow 17.0.0

When accessing a parquet file using parquet.read_table an incompatible types exception is thrown:

>>> pa.parquet.read_table('gs://****/v1/li191r/ms=2023-01/source_id=9319/li191r_9319_2023-01-02.parquet')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/sjacobs/tmp/venv/lib/python3.12/site-packages/pyarrow/parquet/core.py", line 1793, in read_table
    dataset = ParquetDataset(
              ^^^^^^^^^^^^^^^
  File "/Users/sjacobs/tmp/venv/lib/python3.12/site-packages/pyarrow/parquet/core.py", line 1371, in __init__
    self._dataset = ds.dataset(path_or_paths, filesystem=filesystem,
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sjacobs/tmp/venv/lib/python3.12/site-packages/pyarrow/dataset.py", line 794, in dataset
    return _filesystem_dataset(source, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sjacobs/tmp/venv/lib/python3.12/site-packages/pyarrow/dataset.py", line 486, in _filesystem_dataset
    return factory.finish(schema)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/_dataset.pyx", line 3089, in pyarrow._dataset.DatasetFactory.finish
  File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: Unable to merge: Field source_id has incompatible types: string vs dictionary<values=int32, indices=int32, ordered=0>

But accessing via dataset works:

>>> import pyarrow.dataset as ds
>>> df = ds.dataset('gs://***/v1/li191r/ms=2023-01/source_id=9319/li191r_9319_2023-01-02.parquet').to_table().to_pandas()
df
>>> df
      source_id site_id              readout_time   voltage
0          9319    SJER 2023-01-02 00:00:00+00:00  0.000159
1          9319    SJER 2023-01-02 00:00:01+00:00  0.000159
2          9319    SJER 2023-01-02 00:00:02+00:00  0.000160
3          9319    SJER 2023-01-02 00:00:03+00:00  0.000159
4          9319    SJER 2023-01-02 00:00:04+00:00  0.000157
...         ...     ...                       ...       ...
86395      9319    SJER 2023-01-02 23:59:55+00:00  0.000049
86396      9319    SJER 2023-01-02 23:59:56+00:00  0.000048
86397      9319    SJER 2023-01-02 23:59:57+00:00  0.000049
86398      9319    SJER 2023-01-02 23:59:58+00:00  0.000048
86399      9319    SJER 2023-01-02 23:59:59+00:00  0.000048

[86400 rows x 4 columns]
>>>

When I revert to pyarrow 16.1.0 both methods work:

>>> t = pa.parquet.read_table('gs://***/v1/li191r/ms=2023-01/source_id=9319/li191r_9319_2023-01-02.parquet')
>>> t.to_pandas()
      source_id site_id              readout_time   voltage
0          9319    SJER 2023-01-02 00:00:00+00:00  0.000159
1          9319    SJER 2023-01-02 00:00:01+00:00  0.000159
2          9319    SJER 2023-01-02 00:00:02+00:00  0.000160
3          9319    SJER 2023-01-02 00:00:03+00:00  0.000159
4          9319    SJER 2023-01-02 00:00:04+00:00  0.000157
...         ...     ...                       ...       ...
86395      9319    SJER 2023-01-02 23:59:55+00:00  0.000049
86396      9319    SJER 2023-01-02 23:59:56+00:00  0.000048
86397      9319    SJER 2023-01-02 23:59:57+00:00  0.000049
86398      9319    SJER 2023-01-02 23:59:58+00:00  0.000048
86399      9319    SJER 2023-01-02 23:59:59+00:00  0.000048

[86400 rows x 4 columns]

I've tried using the fs implementation to list the bucket in 17.0.0 and that works fine, I have no idea what is wrong here:

>>> from pyarrow import fs
>>> gcs = fs.GcsFileSystem()
>>> file_list = gcs.get_file_info(fs.FileSelector('***t/v1/li191r/ms=2023-01/source_id=9319/', recursive=False))
>>> file_list
[<FileInfo for '***/v1/li191r/ms=2023-01/source_id=9319/li191r_9319_2023-01-01.parquet': type=FileType.File, size=418556>, <FileInfo for '***/v1/li191r/ms=2023-01/source_id=9319/li191r_9319_2023-01-02.parquet': type=FileType.File, size=401198>,  (and so on) ]

If I download the file locally and open it, it works. This same error also occurs in pandas > 2.0.0 with pandas.read_parquet()

Component(s)

Python

@brokenjacobs brokenjacobs changed the title [Python] Accessing parquet files with read_parquet in google cloud storage fails, but works with dataset [Python] Accessing parquet files with read_parquet in google cloud storage fails, but works with dataset, works in 16.1.0 fails in 17.0.0 Aug 6, 2024
@brokenjacobs brokenjacobs changed the title [Python] Accessing parquet files with read_parquet in google cloud storage fails, but works with dataset, works in 16.1.0 fails in 17.0.0 [Python] Accessing parquet files with parquet.read_table in google cloud storage fails, but works with dataset, works in 16.1.0 fails in 17.0.0 Sep 4, 2024
@amoeba
Copy link
Member

amoeba commented Sep 5, 2024

Can you share the schema of the file here? pa.parquet.read_schema('gs://****/v1/li191r/ms=2023-01/source_id=9319/li191r_9319_2023-01-02.parquet') should be enough.

@amoeba
Copy link
Member

amoeba commented Sep 5, 2024

I suspect your Parquet file has a "source_id" column with type string, see

import os

import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

# Setup
os.mkdir("dataset_root")
os.mkdir("dataset_root/source_id=9319")
tbl = pa.table(
    pd.DataFrame(
        {"source_id": ["9319", "9319", "9319"], "x": np.random.randint(0, 10, 3)}
    )
)
pq.write_table(tbl, "dataset_root/source_id=9319/li191r_9319_2023-01-02.parquet")

# This reproduces the issue
pq.read_table("dataset_root/source_id=9319/li191r_9319_2023-01-02.parquet")
# Traceback (most recent call last):
#   File "<stdin>", line 1, in <module>
#   File "/Users/bryce/Work/GH-43574/venv/lib/python3.12/site-packages/pyarrow/parquet/core.py", line 1793, in read_table
#     dataset = ParquetDataset(
#               ^^^^^^^^^^^^^^^
#   File "/Users/bryce/Work/GH-43574/venv/lib/python3.12/site-packages/pyarrow/parquet/core.py", line 1371, in __init__
#     self._dataset = ds.dataset(path_or_paths, filesystem=filesystem,
#                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#   File "/Users/bryce/Work/GH-43574/venv/lib/python3.12/site-packages/pyarrow/dataset.py", line 794, in dataset
#     return _filesystem_dataset(source, **kwargs)
#            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#   File "/Users/bryce/Work/GH-43574/venv/lib/python3.12/site-packages/pyarrow/dataset.py", line 486, in _filesystem_dataset
#     return factory.finish(schema)
#            ^^^^^^^^^^^^^^^^^^^^^^
#   File "pyarrow/_dataset.pyx", line 3089, in pyarrow._dataset.DatasetFactory.finish
#   File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
#   File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
# pyarrow.lib.ArrowTypeError: Unable to merge: Field source_id has incompatible types: string vs dictionary<values=int32, indices=int32, ordered=0>

@brokenjacobs
Copy link
Author

Can you share the schema of the file here? pa.parquet.read_schema('gs://****/v1/li191r/ms=2023-01/source_id=9319/li191r_9319_2023-01-02.parquet') should be enough.

source_id: string
site_id: string
readout_time: timestamp[ms, tz=UTC]
voltage: float
kafka_key: string
kakfa_ts_type: uint8
kafka_ts: timestamp[ms]
kafka_partition: uint8
kafka_offset: uint64
kafka_topic: string
ds: string
-- schema metadata --
pandas: '{"index_columns": [], "column_indexes": [], "columns": [{"name":' + 1502

I've also confirmed this bug on local filesystem as well as via cloud storage. And a good workaround is to pass partitioning=none to the read_table call.

@brokenjacobs
Copy link
Author

FWIW we have other files with alphanumerics in that field as well.

@amoeba
Copy link
Member

amoeba commented Sep 5, 2024

Thanks. Some thoughts:

  • read_table errors in your original code where ds.dataset does not because (1) read_table defaults to Hive partitioning and ds.dataset doesn't (2) your file contains a source_id field and its file path also includes source_id=X as a component of the path. With partitioned datasets, partition fields are usually omitted from the files themselves and I'm not sure what the behavior should be if the user leaves them in. The current behavior seems to be that the reader ignores the field in the file and trusts the partition field value in the file path.
  • ds.dataset succeeds because it's defaulting to Directory partitioning so it's totally ignoring the Hive partition scheme in your file path. You can make the ds.dataset call fail if you specify Hive partitioning (though with a slightly different error).

You have a few workarounds:

  1. Remove the source_id field from your Parquet files. This is what I would do.
  2. Manually specify a schema,
    schm = pa.schema([pa.field("source_id", pa.string())])
    pq.read_table("dataset_root/source_id=9319/li191r_9319_2023-01-02.parquet", schema=schm)
  3. Manually specify partitoining=None

Is there a reason why (1) might not work for you?

@bkurtz
Copy link

bkurtz commented Oct 30, 2024

I'll chime in here, since I believe this is a behavior change that was unintentionally introduced in pyarrow 17.0 via #39438, specifically the two lines under discussion here: https://github.com/apache/arrow/pull/39438/files#r1469251517

Previously, specifying a single file would always just return the contents of the file; now it will include any hive partition metadata columns that happen to be detected in the path.

In particular, this is frustrating because it behaves differently for files on the local filesystem (the file is a BufferedReader rather than a path at that point, so hive metadata in the leading portion of the path is not interpreted) vs on cloud storage like AWS or S3.

Other bits that made this unexpected and/or could be improved if it's considered desirable to keep this change:

  • the behavior change was not noted in the release notes (as it appears to be unexpected); had to consult the source code to find the offending change
  • pandas uses read_table as its entrypoint into pyarrow
  • the docs for read_table don't make it particularly clear that calling with partitioning=None is an option

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants