-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Column equality testing fixes #10011
Column equality testing fixes #10011
Conversation
Codecov Report
@@ Coverage Diff @@
## branch-22.04 #10011 +/- ##
===============================================
Coverage ? 10.46%
===============================================
Files ? 122
Lines ? 20523
Branches ? 0
===============================================
Hits ? 2147
Misses ? 18376
Partials ? 0 Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should add some tests in test_testing.py
to cover these changes.
python/cudf/cudf/testing/testing.py
Outdated
) | ||
) | ||
) | ||
and cp.allclose( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, cupy.allclose
has a parameter equal_nan
that may simplify the is_nan
check above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't see a way of using equal_nan
but I agree that the logic was a little hard to follow in general, so I redid it here. Let me know if you think this is better.
python/cudf/cudf/testing/testing.py
Outdated
elif not ( | ||
(is_string_dtype(left) and is_numeric_dtype(right)) | ||
or (is_numeric_dtype(left) and is_string_dtype(right)) | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if I follow these checks. Where the handler if the input falls in these dtypes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this seems odd. Are we just generally looking to avoid checking this for mismatched dtypes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If one column is string and the other is numeric, we can assume the columns are not equal. (1 != '1')
. These lines check to make sure exactly one of the columns is string and the other is numeric, in which case we avoid the entire try/except block and therefore avoid any opportunity for columns_equal
to be set to True
. We should end up on line 243 from there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this only an issue for string types, or do we need to worry about other types as well? I see categoricals are handled above, what about list/struct dtypes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked into what happens here and it would appear that when list
or struct
are involved we fall into the left.equals(right)
check at the very end and end up with a TypeError
, except in the case that we're comparing list
to list
or struct
to struct
. That should probably not happen.
I see the point I think you are making though: there's a certain set of dtypes (beyond just string) where, even if check_dtype=False
, we know up front that it's a 100% mismatch between non-null elements simply because a struct can't compare as equal to a "not struct". I think these dtypes are:
- String
- List
- Struct
- Interval
- Decimal
I suppose the only edge case here would be that we should arguably return True
even for that subset of dtypes if the columns are fully null.
If this seems like the correct logic I am happy to go back and wire it up as such here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep that all sounds good. For the fully null case, I think whether or not we return True should be determined by check_dtypes
, but if someone has a different opinion I'm open to a different result.
python/cudf/cudf/testing/testing.py
Outdated
if not columns_equal: | ||
msg1 = f"{left.values_host}" | ||
msg2 = f"{right.values_host}" | ||
ldata = [val for val in left.to_pandas(nullable=True)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Side question - I've wondered how reliable nullable=True
is given pandas support for nullable float dtype is still non-public?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like they're being fairly forward with these as of 1.2.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some suggestions for improvement here and one question.
python/cudf/cudf/testing/testing.py
Outdated
elif not ( | ||
(is_string_dtype(left) and is_numeric_dtype(right)) | ||
or (is_numeric_dtype(left) and is_string_dtype(right)) | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this only an issue for string types, or do we need to worry about other types as well? I see categoricals are handled above, what about list/struct dtypes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One last suggestion, otherwise good on my end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's all good. Just minor stuff below.
Co-authored-by: Michael Wang <[email protected]>
@gpucibot merge |
Fixes a bug where empty columns were not comparing correctly as well as a few edge cases with strings
Partially addresses #8513