-
Notifications
You must be signed in to change notification settings - Fork 916
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle arbitrarily different data in null list column rows when checking for equivalency. #8666
Handle arbitrarily different data in null list column rows when checking for equivalency. #8666
Conversation
…itrarily different data (offsets, values) in null rows.
Thanks @nvdbaranec. I cherry-picked your commits into my PR #8588 to test. The previously failing
UPDATE: it turned out to be a bug in the test in #8588 |
rerun tests |
Codecov Report
@@ Coverage Diff @@
## branch-21.08 #8666 +/- ##
===============================================
Coverage ? 10.53%
===============================================
Files ? 116
Lines ? 18916
Branches ? 0
===============================================
Hits ? 1993
Misses ? 16923
Partials ? 0 Continue to review full report at Codecov.
|
…o have the various expect_columns_* functions throw instead of print upon failure, allowing for use of EXPECT_THROW(...). Add tests. Couple of small fixes.
Does the Arrow spec allow for null lists to have non-zero lengths? |
I believe so. https://arrow.apache.org/docs/format/Columnar.html#variable-size-list-layout |
|
||
// if the row is valid, check that the length of the list is the same. do this | ||
// for both the equivalency and exact equality checks. | ||
if (lhs_valids[lhs_index] && ((lhs_offsets[lhs_index + 1] - lhs_offsets[lhs_index]) != |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we can do something simpler, like pulling all contents of the original columns (using gather with gather map is the simple sequence [0, 1, 2, 3,..., size - 1]
). Then, we just use the existing code to compare the resulted columns (which have zero size for the nulls after gathering).
Of course, we only use gather if the column has nulls. I think the total overhead here is very little. And since we are running this in tests, not in production, we don't have to worry much about such performance penalty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit of a large change, and one think I don't particularly like about it is that it uses a function (gather()
) which will in turn be using this function to verify itself.
namespace test { | ||
|
||
namespace { | ||
|
||
// expand all non-null rows in a list column into a column of child row indices. | ||
std::unique_ptr<column> generate_child_row_indices(lists_column_view const& c, | ||
column_view const& row_indices) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need an assertion for row_indices
not having nulls? Overkill for cudf::test
code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P.S. This was an informative read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's overkill here. row_indices
is never something handed to us by the external user. It's purely internal.
auto const rhs_index = rhs_indices[index]; | ||
|
||
// check for validity match | ||
if (lhs_valids[lhs_index] != rhs_valids[rhs_index]) { return true; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might have lost the plot here: Should we not consider the offset
of lhs
and rhs
at this point, because they might be sliced columns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lhs_valids
and rhs_valids
are iterators that take care of the offset inline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a while to review. Generally +1. A couple of minor nitpicks. One spot where I wonder if we've handled sliced columns.
rerun tests |
… void. Change the print_all_differences parameter to be an enum with 3 values : FIRST_ERROR, ALL_ERRORS and QUIET.
I've changed things so that the comparison functions now return a true/false value and the
|
rerun tests |
@gpucibot merge |
Uses scalar-vector-based scatter API to provide support for copy_if_else involving scalar columns. Other changes: - removes some dead code - refactoring into overloaded functions Closes #8361, depends on #8630, #8666 Authors: - Gera Shegalov (https://github.com/gerashegalov) Approvers: - https://github.com/nvdbaranec - MithunR (https://github.com/mythrocks) URL: #8588
rapidsai/cudf#8666 modified `cudf::test` APIs to accept a verbosity enum as a parameter to control output, which is backwards incompatible with the previously boolean parameter. Authors: - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Mark Harris (https://github.com/harrism) - Paul Taylor (https://github.com/trxcllnt) URL: #433
The column equivalency checking code was not handling a particular corner case properly. Fundamentally, there is no requirement that the offsets or child data for null rows in two list columns to be the same. An example:
At first glance, these columns do not seem equivalent. However, the only two non-null rows (2 and 4) are identical:
[[3, 3], [5, 5, 5]]
The comparison code was expecting row positions to always be the same inside of child rows, but that does not have to be the case. For example, in the first column, the child row indices that we care about are
[6, 7, 11, 12, 13]
, whereas in the second column they are[0, 1, 2, 3, 4]
The fix for this is to fundamentally change how the comparison code works so that instead of simply iterating from
0
tosize
for each column, we instead provide an explicit list of column indices that should be compared. The various compare functors now take additionallhs_row_indices
andrhs_row_indices
columns to reflect this.For flat hierarchies, this input is always just
[0, 1, 2, 3... size]
. However, every time we encounter a list column in the hierarchy, the rows that need to be considered for both columns can be completely and arbitrarily changed.I'm leaving this as a draft as there is a discussion point in the column property comparisons that I think is worth having. Similar to the data values, one of the things the column property comparison wanted to do was simply compare
lhs.size()
torhs.size()
. But as we can see for the leaf columns in the above case, they are totally different. However, when we are only checking for equivalency what matters is that the number of rows we are going to be comparing is the same. Similarly, the null counts cannot be compared directly. Just the null count of the rows we are explicitly comparing. As far as I can tell, this is the only way to do it, but I'm not sure it's 100% semantically in the spirit of what the column properties are, since we are really checking the properties of a subset of the overall column.I left a couple of comments in the property comparator code labelled
// DISCUSSION
Note: I haven't added tests yet.