Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QST] expect_columns_equal() vs expect_columns_equivalent() #5867

Closed
mythrocks opened this issue Aug 6, 2020 · 7 comments
Closed

[QST] expect_columns_equal() vs expect_columns_equivalent() #5867

mythrocks opened this issue Aug 6, 2020 · 7 comments
Labels
libcudf Affects libcudf (C++/CUDA) code. question Further information is requested

Comments

@mythrocks
Copy link
Contributor

I seek guidance on which test scenarios we should be using cudf::test:expect_columns_equal() vs cudf::test::expect_columns_equivalent(). I ran into this while writing structs_column_test, as part of #5807. I was advised to put up an issue for discussion here.

As discussed in #5700, when constructing a structs column, the parent column's null-mask will be ANDed with the children's, but the data streams are not modified.
The values in the resulting child column may be checked for correctness with an expected_column (via the appropriate column wrapper).
For primitive column types, both equivalence and equality checks succeed.
For list columns, they fail. :/

It turns out that cudf::test::column_comparator_impl<list_view> compares list columns by recursively comparing its data/offset contents. Constructing a list containing nulls via list_column_wrapper does not produce the same contents as a column whose null-mask has been twiddled.

Should one modify column_comparator_impl<list_view> to check for equivalence differently from equality?
Also, how would the caller of expect_columns_equal() discern that they should instead be calling expect_columns_equivalent()?

I think @nvdbaranec ran into a similar (or "equivalent" :]) issue, when testing list<string>.

@mythrocks mythrocks added question Further information is requested Needs Triage Need team to review and classify labels Aug 6, 2020
@jrhemstad
Copy link
Contributor

"equal" means two columns are exactly, bitwise equal.

"equivalent" means two columns are functionality equivalent.

Example, given two columns a and b:

  • If a.nullable() == true and a.null_count() == 0, while b.nullable() == false (which implies b.null_count() == ) then:
    • a and b are equivalent if all of their elements are equivalent, but they are not equal because they differ in nullability

This was the original intention of the difference between expect_column_equal and expect_column_equivalent.

Equivalence can also have special meaning for floating point values where elements can be considered equivalent within some units of least precision.

@nvdbaranec
Copy link
Contributor

My opinion here is : "equals" is either semantically incorrect or people are using it with the wrong expectations. 99.9% of the time what you care about is "if I were to pass either of these columns along to further computation I would get the same results" and I'm pretty sure that's how people are using it.

But the current definition of equals is much more strict than this, which causes unexpected side-effects and retroactive breakage. The clear example is how it is ok for empty strings columns to either have children or not. Depending on the whims of the function you called, you might get back a column which will pass expect_columns_equal() today but fail it tomorrow.

The concrete instance of this I'd point out is the change I have in flight to empty_like(). With the new change:

  • if you call empty_like() on a strings column that has values in it, you will get back an empty strings column with children
  • if you call make_empty_column(STRING) you will get back an empty strings column with no children

This is valid by the standards of cudf. But when I made this change, a few random tests started failing because they were using equals() when what they needed was equivalent(). The end result is - 6 or 7 random checks in semi_join_tests.cpp mysteriously use expect_columns_equivalent() when all the surrounding ones use expect_columns_equal() https://github.com/rapidsai/cudf/pull/5816/files#diff-f2512c5550f8bf5f410a0c523f856393 If I'm an end-user that seems random and unexplained.

So I guess I would advocate one of two things:

  • Change expect_columns_equal() to expect_columns_bitwise_equal() or something equally strongly worded. And change expect_columns_equivalent() to expect_columns_equal()

or

  • Make all tests use expect_columns_equivalent() by default and only use expect_columns_equal() when they truly want absolute equality across the board.

@jrhemstad
Copy link
Contributor

jrhemstad commented Aug 6, 2020

"if I were to pass either of these columns along to further computation I would get the same results"

That describes equivalence, not equality.

I'm pretty sure that's how people are using it.

This is likely because expect_columns_equivalent came along long after expect_columns_equal, so for a long time expect_columns_equal was the only thing.

Change expect_columns_equal() to expect_columns_bitwise_equal() or something equally strongly worded. And change expect_columns_equivalent() to expect_columns_equal()

Equivalence vs. equality is a well-established concept in mathematics, programming languages, and even C++ (e.g., a and b are equal if a==b, a and b are equivalent if !(a < b) && !(b < a)). I think we should just stick with the standard terminology.

Make all tests use expect_columns_equivalent() by default and only use expect_columns_equal() when they truly want absolute equality across the board.

I'd definitely support this. I agree that most places should probably use expect_columns_equivalent. The only reason they are not is because many tests were written before expect_columns_equivalent existed.

@kkraus14 kkraus14 added libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Aug 7, 2020
@mythrocks
Copy link
Contributor Author

I'd definitely support this. I agree that most places should probably use expect_columns_equivalent. The only reason they are not is because many tests were written before expect_columns_equivalent existed.

Thank you for the advice, @jrhemstad, @nvdbaranec. I will switch the struct tests to use expect_columns_equivalent(). (It's not much, but it's a start.)
I'll tackle the equivalence checks for list as soon as it is viable, unless someone beats me to it.

@nvdbaranec
Copy link
Contributor

nvdbaranec commented Aug 9, 2020

To throw another log on the fire here : I just realized that expect_columns_equal() already returns true in a case that is just "equivalent". Specifically : sliced columns. The following all passes:

cudf::test::fixed_width_column_wrapper<int> a{1, 2, 3, 4, 5, 6};
auto sliced = cudf::split(a, {2});
    
cudf::test::fixed_width_column_wrapper<int> expected0{1, 2};
cudf::test::expect_columns_equal(sliced[0], expected0);

cudf::test::fixed_width_column_wrapper<int> expected1{3, 4, 5, 6};    
cudf::test::expect_columns_equal(sliced[1], expected1);

But these columns aren't really truly equal. The sliced/split columns have an offset > 0 and their data buffers are different (this difference is more exaggerated in columns that contain offsets such as strings and lists). expected0 and expected1 do not have an offset and their data is different. Equivalent, absolutely, but not the same.

@github-actions
Copy link

This issue has been marked rotten due to no recent activity in the past 90d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

@jrhemstad
Copy link
Contributor

I'm going to mark this as closed as the question of equal vs equivalent has been answered.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
libcudf Affects libcudf (C++/CUDA) code. question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants