Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Column equality testing fixes #10011

Conversation

brandon-b-miller
Copy link
Contributor

@brandon-b-miller brandon-b-miller commented Jan 11, 2022

Fixes a bug where empty columns were not comparing correctly as well as a few edge cases with strings

Partially addresses #8513

@github-actions github-actions bot added the Python Affects Python cuDF API. label Jan 11, 2022
@codecov
Copy link

codecov bot commented Jan 11, 2022

Codecov Report

❗ No coverage uploaded for pull request base (branch-22.04@8d2a9cc). Click here to learn what that means.
The diff coverage is n/a.

❗ Current head ed7c630 differs from pull request most recent head a44b97f. Consider uploading reports for the commit a44b97f to get more accurate results

Impacted file tree graph

@@               Coverage Diff               @@
##             branch-22.04   #10011   +/-   ##
===============================================
  Coverage                ?   10.46%           
===============================================
  Files                   ?      122           
  Lines                   ?    20523           
  Branches                ?        0           
===============================================
  Hits                    ?     2147           
  Misses                  ?    18376           
  Partials                ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8d2a9cc...a44b97f. Read the comment docs.

@brandon-b-miller brandon-b-miller added bug Something isn't working non-breaking Non-breaking change 3 - Ready for Review Ready for review by team labels Jan 11, 2022
@brandon-b-miller brandon-b-miller marked this pull request as ready for review January 11, 2022 16:46
@brandon-b-miller brandon-b-miller requested a review from a team as a code owner January 11, 2022 16:46
Copy link
Contributor

@isVoid isVoid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should add some tests in test_testing.py to cover these changes.

)
)
)
and cp.allclose(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, cupy.allclose has a parameter equal_nan that may simplify the is_nan check above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't see a way of using equal_nan but I agree that the logic was a little hard to follow in general, so I redid it here. Let me know if you think this is better.

Comment on lines 212 to 215
elif not (
(is_string_dtype(left) and is_numeric_dtype(right))
or (is_numeric_dtype(left) and is_string_dtype(right))
):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if I follow these checks. Where the handler if the input falls in these dtypes?

Copy link
Contributor

@vyasr vyasr Jan 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this seems odd. Are we just generally looking to avoid checking this for mismatched dtypes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If one column is string and the other is numeric, we can assume the columns are not equal. (1 != '1'). These lines check to make sure exactly one of the columns is string and the other is numeric, in which case we avoid the entire try/except block and therefore avoid any opportunity for columns_equal to be set to True. We should end up on line 243 from there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this only an issue for string types, or do we need to worry about other types as well? I see categoricals are handled above, what about list/struct dtypes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked into what happens here and it would appear that when list or struct are involved we fall into the left.equals(right) check at the very end and end up with a TypeError, except in the case that we're comparing list to list or struct to struct. That should probably not happen.

I see the point I think you are making though: there's a certain set of dtypes (beyond just string) where, even if check_dtype=False, we know up front that it's a 100% mismatch between non-null elements simply because a struct can't compare as equal to a "not struct". I think these dtypes are:

  • String
  • List
  • Struct
  • Interval
  • Decimal

I suppose the only edge case here would be that we should arguably return True even for that subset of dtypes if the columns are fully null.

If this seems like the correct logic I am happy to go back and wire it up as such here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep that all sounds good. For the fully null case, I think whether or not we return True should be determined by check_dtypes, but if someone has a different opinion I'm open to a different result.

if not columns_equal:
msg1 = f"{left.values_host}"
msg2 = f"{right.values_host}"
ldata = [val for val in left.to_pandas(nullable=True)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side question - I've wondered how reliable nullable=True is given pandas support for nullable float dtype is still non-public?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like they're being fairly forward with these as of 1.2.0

@shwina shwina changed the base branch from branch-22.02 to branch-22.04 January 20, 2022 21:24
@brandon-b-miller
Copy link
Contributor Author

everything look ok here? cc @vyasr @isVoid

Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some suggestions for improvement here and one question.

Comment on lines 212 to 215
elif not (
(is_string_dtype(left) and is_numeric_dtype(right))
or (is_numeric_dtype(left) and is_string_dtype(right))
):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this only an issue for string types, or do we need to worry about other types as well? I see categoricals are handled above, what about list/struct dtypes?

python/cudf/cudf/testing/testing.py Outdated Show resolved Hide resolved
python/cudf/cudf/testing/testing.py Outdated Show resolved Hide resolved
python/cudf/cudf/testing/testing.py Outdated Show resolved Hide resolved
python/cudf/cudf/testing/testing.py Show resolved Hide resolved
python/cudf/cudf/testing/testing.py Outdated Show resolved Hide resolved
python/cudf/cudf/testing/testing.py Show resolved Hide resolved
Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last suggestion, otherwise good on my end

Copy link
Contributor

@isVoid isVoid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's all good. Just minor stuff below.

python/cudf/cudf/testing/testing.py Outdated Show resolved Hide resolved
@brandon-b-miller brandon-b-miller added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Feb 8, 2022
@brandon-b-miller
Copy link
Contributor Author

@gpucibot merge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge bug Something isn't working non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants