-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cannot write a column of type DataType::List
containing a DataType::Struct
to parquet with parallel writing
#8851
Comments
I think this project is the correct place to start. We can distill your example down and see if we can get an arrow-rs only reproducer @tustvold do you have any thoughts? |
It could be something in the arrow upgrade from 47 to 48, but nothing immediately springs out to me. Edit: the DF parallel parquet writer looks to have landed around then, I wonder if it is the cause. |
We can try the following setting and see if that fixes this bug to verify if it is the cause.
Edit: I was able to reproduce the error on latest main with the above setting to true, and I can confirm that setting it to false fixes the error. |
thank you all for the quick response! For now setting the variable is a good workaround for my usecase 👍🏻. |
DataType::List
containing a DataType::Struct
to parquetDataType::List
containing a DataType::Struct
to parquet with parallel writing
@devinjdangelo set the feature Let's leave this issue open until the code is sorted but I think it is much less severe now |
Describe the bug
When trying to write a column that is of type List that contains a Struct, the parquet writer throws an error
Error: ParquetError(General("Incorrect number of rows, expected 4 != 0 rows"))
. This seems to be a regression as this works fine in datafusion v32.0.0 but not in v33 or v34. It also works usingwrite_json
instead ofwrite_parquet
Example dataframe:
To Reproduce
dependencies (working):
dependencies (broken):
example.json
main.rs
Expected behavior
The parquet writer supports writing this kind of datatype as in v32
Additional context
Maybe related to: apache/arrow-rs#1744
I found this issue trying to debug a different one that came up while trying to upgrade from v32 to v34. If the struct contains a timestamp the error instead becomes a
Error: Internal("Unable to send array to writer!")
with a source errorinternal error: entered unreachable code: cannot downcast Int64 to byte array
.An example of such a df:
I tried to debug this issue myself looking into the
arrow-rs
implementation however I didn't manage to find the relevant commit that could have changed this behavior. Also I wasn't sure if I should open the bug in this project or in thearrow-rs
project so I hope this is ok 😃.The text was updated successfully, but these errors were encountered: