-
Notifications
You must be signed in to change notification settings - Fork 222
Add missing call to try_push_valid
for nested avro deserialization
#1248
Conversation
Hi @shaeqahmed . Thank you for the PR! I am trying to understand why this is needed. My understanding is that I added #1250 on an attempt to argue that it works as is. |
Codecov Report
@@ Coverage Diff @@
## main #1248 +/- ##
==========================================
- Coverage 83.15% 83.13% -0.02%
==========================================
Files 359 359
Lines 38063 38069 +6
==========================================
- Hits 31650 31649 -1
- Misses 6413 6420 +7
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
The logic for Structs's currently is to lazily init the validity array and pad it with true's followed by a false on the first appearance of a null. After this, whenever we deserialize an item, if it is the null variant of a union, we push_null(). However, if the struct is not null, meaning the struct is non-empty, we don't push a I believe similar call is needed for MutableStructArray, and that's what I tried to add in this PR. I have a nested avro file that in the current logic triggers this issue, and fails to deserialize because the length of validity is too short. I will try share it as an example, in addition to trying to create a minimal reproducible example for you. Please let me know if this makes sense. |
Update: this PR is incorrect, as a naive call to |
You are right - this is a bug and the fix is this PR 👍 Thanks for the explanation. If all elements are null, we still need to append the "true" to the StructArray. In arrow:
You have a very good understanding of Arrow and this crate :) I think this PR would just benefit from a test to confirm our hypothesis. |
Awesome, that makes sense, working on the updated PR revision + test cases for this locally. Gonna put something up tonight. 👍 With this change, I can read in my deeply nested nullable structs to arrow2 now, but looks like the arrow2/parquet2 parquet writer is writing corrupt files. I see that in the code there are some explicitly unimplemented branches such as reading nulls from parquet, and the docs do mention lack of proper support for deeply nested parquet types (related issue: #1222). I think we should 1) better document what is currently supported and not for arrow2 parquet functionality, and 2) fix the above mentioned bug so that we don't at least write incorrect files. I've already opened an issue to track this here: #1249 so let's continue the discussion there. I'd appreciate if you could take a look. Thanks for the help |
The validity len becomes out of sync compared to len (values[0].len) because there is no call to append valid after sucessfully deserializing a struct item. Add this functionality for
MutableSturctArray
similar to how it is handled forMutableListArray