array_agg with pyarrow errors with ArrowInvalid: Schema at index 0 was different #8032

Maxsparrow · 2023-11-02T15:22:41Z

Describe the bug

When using the array_agg function in a query on a DataFusion table generated by a pyarrow Dataset on parquet files, the .to_pandas() and .to_polars() functions error with the below stacktrace:

---------------------------------------------------------------------------
ArrowInvalid                              Traceback (most recent call last)
Cell In[29], line 1
----> 1 results.to_pandas()

File /opt/conda/envs/opt-strategies-amer/lib/python3.10/site-packages/pyarrow/table.pxi:4057, in pyarrow.lib.Table.from_batches()

File /opt/conda/envs/opt-strategies-amer/lib/python3.10/site-packages/pyarrow/error.pxi:154, in pyarrow.lib.pyarrow_internal_check_status()

File /opt/conda/envs/opt-strategies-amer/lib/python3.10/site-packages/pyarrow/error.pxi:91, in pyarrow.lib.check_status()

ArrowInvalid: Schema at index 0 was different: 
double_field: list<item: double>
string_field: list<item: string>
vs
double_field: list<item: double> not null
string_field: list<item: string> not null

To Reproduce

Using latest pyarrow and datafusion (other versions not tested):
pyarrow==14.0.0
datafusion==32.0.0

import pandas as pd
df = pd.DataFrame({'double_field': [1.5, 2.5, 3.5], 'string_field': ['a', 'b', 'c']})
df.to_parquet("test_array_agg.parquet")

from pyarrow import dataset
test_dataset = dataset.dataset("test_array_agg.parquet")

from datafusion import SessionContext
ctx = SessionContext()
ctx.register_dataset("test_table", test_dataset)
tbl = ctx.table("test_table")

query = """
SELECT
    array_agg("double_field" ORDER BY "string_field") as "double_field",
    array_agg("string_field" ORDER BY "string_field") as "string_field"
FROM test_table
"""
results = ctx.sql(query)

# Result schema is list<item: double>
print(results.schema())

# First pyarrow.RecordBatch schema is list<item: double> not null
print(results.collect()[0].schema)

# ERRORS with ArrowInvalid
results.to_pandas()

Expected behavior

The .to_pandas() method should return a DataFrame with arrays in each cell.

Additional context

The query works and returns the expected result if you wrap array_agg in additional calls to make_array and array_extract:

SELECT
    array_extract(make_array(array_agg("double_field") ORDER BY "string_field"), 1) as "double_field",
    array_extract(make_array(array_agg("string_field") ORDER BY "string_field"), 1) as "string_field"
FROM test_table

Returns from .to_pandas() with:

In this case both results.schema() and results.collect()[0].schema are list<item: double>/list<item: string>

The text was updated successfully, but these errors were encountered:

alamb · 2023-11-03T16:22:35Z

fyi @jayzhan211

alamb · 2023-11-03T16:23:42Z

Thank you for the report @Maxsparrow -- I have added this to our general epic for array implementation #6980

Maxsparrow added the bug Something isn't working label Nov 2, 2023

alamb mentioned this issue Nov 3, 2023

[Epic] General ticket for the concept of the practical implementation of ARRAY #6980

Closed

27 tasks

jayzhan211 mentioned this issue Nov 5, 2023

Fix ArrayAgg schema mismatch issue #8055

Merged

alamb closed this as completed in #8055 Nov 9, 2023

jayzhan211 mentioned this issue Jun 23, 2024

Add input_nullable for UDAF args StateField and Accumulator #11063

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

array_agg with pyarrow errors with ArrowInvalid: Schema at index 0 was different #8032

array_agg with pyarrow errors with ArrowInvalid: Schema at index 0 was different #8032

Maxsparrow commented Nov 2, 2023

alamb commented Nov 3, 2023

alamb commented Nov 3, 2023

array_agg with pyarrow errors with ArrowInvalid: Schema at index 0 was different #8032

array_agg with pyarrow errors with ArrowInvalid: Schema at index 0 was different #8032

Comments

Maxsparrow commented Nov 2, 2023

Describe the bug

To Reproduce

Expected behavior

Additional context

alamb commented Nov 3, 2023

alamb commented Nov 3, 2023