-
Notifications
You must be signed in to change notification settings - Fork 855
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parquet write failure (from record batches) when data is nested two levels deep #1744
Comments
ahmedriza
changed the title
Parquet write failure when data is nested two levels deep
Parquet write failure (from record batches) when data is nested two levels deep
May 24, 2022
This looks very similar to #1651 which fixed the read side, there is likely a similar issue on the write side. Thank you for the report, I'll take a look tomorrow |
Cool @tustvold. I do recall the reader side error as well before version 14. Thanks a lot. |
tustvold
added a commit
to tustvold/arrow-rs
that referenced
this issue
May 26, 2022
tustvold
added a commit
to tustvold/arrow-rs
that referenced
this issue
May 26, 2022
For anyone following along, there is a PR proposing to fix this: #1746 |
tustvold
added a commit
that referenced
this issue
May 27, 2022
* Support writing arbitrarily nested arrow arrays (#1744) * More tests * Port more tests * More tests * Review feedback * Reduce test churn * Port remaining tests * Review feedback * Fix clippy
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Describe the bug
Let me introduce the Schema of the data in an easily readable format (the Apache Spark pretty print format):
and some sample data:
As we can see, what we have here are three columns, a
UTF-8
column calledid
and two columns calledprices
andbid
that have the schema, i.e.list<struct<list>>
.I have deliberately left the
bids
column empty to show the bug.The bug is that when when we read Parquet with the above schema, with the
bids
column null for all rows, from Rust code into record batches and then write those record batches to Parquet; the Parquet write fails with:This is happening at https://github.com/apache/arrow-rs/blob/master/parquet/src/file/writer.rs#L324 and is due to the fact that the
bids
column is null.To Reproduce
Let's create the sample data using the schema as depicted above using the following Python code:
When we run this code, we can see that a valid Parquet file is indeed produced. The Parquet that is created is read back and we see the following:
Let's now try to read the same Parquet from Rust and write it back to another Parquet file:
This reads the Parquet file fine, but fails when writing out the record batches as Parquet with the following error:
I can see that this is due to the fact the
bids
column is null.Expected behavior
We should expect the record batches to be written correctly to Parquet even if a column is null for all rows.
Additional context
The issue arises due to the presence of the second level of nesting, i.e. the following
If we remove this second level of nesting, then the null
bids
column does get written. However, we expect this to work even in the presence of the second, third etc level of nesting, which works withpyarrow
as well.The text was updated successfully, but these errors were encountered: