-
Notifications
You must be signed in to change notification settings - Fork 917
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Handle arbitrarily different data in null list column rows when check…
…ing for equivalency. (#8666) The column equivalency checking code was not handling a particular corner case properly. Fundamentally, there is no requirement that the offsets or child data for null rows in two list columns to be the same. An example: ``` List<int32_t>: Length : 7 Offsets : 0, 3, 6, 8, 11, 14, 16, 19 Null count: 7 0010100 1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, 7, 7, 7 List<int32_t>: Length : 7 Offsets : 0, 0, 0, 2, 2, 5, 5, 5 Null count: 7 0010100 3, 3, 5, 5, 5 ``` At first glance, these columns do not seem equivalent. However, the only two non-null rows (2 and 4) are identical: `[[3, 3], [5, 5, 5]]` The comparison code was expecting row positions to always be the same inside of child rows, but that does not have to be the case. For example, in the first column, the child row indices that we care about are `[6, 7, 11, 12, 13]`, whereas in the second column they are `[0, 1, 2, 3, 4]` The fix for this is to fundamentally change how the comparison code works so that instead of simply iterating from `0` to `size` for each column, we instead provide an explicit list of column indices that should be compared. The various compare functors now take additional `lhs_row_indices` and `rhs_row_indices` columns to reflect this. For flat hierarchies, this input is always just `[0, 1, 2, 3... size]`. However, every time we encounter a list column in the hierarchy, the rows that need to be considered for both columns can be completely and arbitrarily changed. I'm leaving this as a draft as there is a discussion point in the column _property_ comparisons that I think is worth having. Similar to the data values, one of the things the column property comparison wanted to do was simply compare `lhs.size()` to `rhs.size()`. But as we can see for the leaf columns in the above case, they are totally different. However, when we are only checking for _equivalency_ what matters is that the number of rows we are going to be comparing is the same. Similarly, the null counts cannot be compared directly. Just the null count of the rows we are explicitly comparing. As far as I can tell, this is the only way to do it, but I'm not sure it's 100% semantically in the spirit of what the column _properties_ are, since we are really checking the properties of a subset of the overall column. I left a couple of comments in the property comparator code labelled `// DISCUSSION` Note: I haven't added tests yet. Authors: - https://github.com/nvdbaranec Approvers: - Mike Wilson (https://github.com/hyperbolic2346) - MithunR (https://github.com/mythrocks) - Jake Hemstad (https://github.com/jrhemstad) - Nghia Truong (https://github.com/ttnghia) URL: #8666
- Loading branch information
1 parent
0e59d05
commit f1fa694
Showing
26 changed files
with
1,126 additions
and
533 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.