You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When using the array_agg function in a query on a DataFusion table generated by a pyarrow Dataset on parquet files, the .to_pandas() and .to_polars() functions error with the below stacktrace:
---------------------------------------------------------------------------
ArrowInvalid Traceback (most recent call last)
Cell In[29], line 1
----> 1 results.to_pandas()
File /opt/conda/envs/opt-strategies-amer/lib/python3.10/site-packages/pyarrow/table.pxi:4057, in pyarrow.lib.Table.from_batches()
File /opt/conda/envs/opt-strategies-amer/lib/python3.10/site-packages/pyarrow/error.pxi:154, in pyarrow.lib.pyarrow_internal_check_status()
File /opt/conda/envs/opt-strategies-amer/lib/python3.10/site-packages/pyarrow/error.pxi:91, in pyarrow.lib.check_status()
ArrowInvalid: Schema at index 0 was different:
double_field: list<item: double>
string_field: list<item: string>
vs
double_field: list<item: double> not null
string_field: list<item: string> not null
To Reproduce
Using latest pyarrow and datafusion (other versions not tested):
pyarrow==14.0.0
datafusion==32.0.0
importpandasaspddf=pd.DataFrame({'double_field': [1.5, 2.5, 3.5], 'string_field': ['a', 'b', 'c']})
df.to_parquet("test_array_agg.parquet")
frompyarrowimportdatasettest_dataset=dataset.dataset("test_array_agg.parquet")
fromdatafusionimportSessionContextctx=SessionContext()
ctx.register_dataset("test_table", test_dataset)
tbl=ctx.table("test_table")
query="""SELECT array_agg("double_field" ORDER BY "string_field") as "double_field", array_agg("string_field" ORDER BY "string_field") as "string_field"FROM test_table"""results=ctx.sql(query)
# Result schema is list<item: double>print(results.schema())
# First pyarrow.RecordBatch schema is list<item: double> not nullprint(results.collect()[0].schema)
# ERRORS with ArrowInvalidresults.to_pandas()
Expected behavior
The .to_pandas() method should return a DataFrame with arrays in each cell.
Additional context
The query works and returns the expected result if you wrap array_agg in additional calls to make_array and array_extract:
Describe the bug
When using the
array_agg
function in a query on a DataFusion table generated by a pyarrow Dataset on parquet files, the .to_pandas() and .to_polars() functions error with the below stacktrace:To Reproduce
Using latest pyarrow and datafusion (other versions not tested):
pyarrow==14.0.0
datafusion==32.0.0
Expected behavior
The
.to_pandas()
method should return a DataFrame with arrays in each cell.Additional context
The query works and returns the expected result if you wrap array_agg in additional calls to make_array and array_extract:
Returns from .to_pandas() with:
In this case both
results.schema()
andresults.collect()[0].schema
arelist<item: double>
/list<item: string>
The text was updated successfully, but these errors were encountered: