Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] cudf.testing.assert_series_equal not corrrectly testing for NaNs and assigning cp.nan to Series the same. #8513

Closed
esnvidia opened this issue Jun 14, 2021 · 9 comments
Assignees
Labels
bug Something isn't working Python Affects Python cuDF API.

Comments

@esnvidia
Copy link

Describe the bug
cudf.testing.assert_series_equal gives assertion error when comparing two series that have cp.nan filled in

Steps/Code to reproduce bug

import cupy as cp
import cudf

data= cp.array([[1.5, 0], [0.6, 1], [2.5, 0]])
expect = cudf.Series([1.5, cp.nan, 2.5])

isnull = data[:, 1] > 0.5  # get null mask
data = cudf.Series(data[:,0])

# apply null mask
data.iloc[isnull] = cp.nan

data
Out:
0    1.5
1    NaN
2    2.5
dtype: float64

cudf.testing.assert_series_equal(data, expect, 'Error in Series equal') -->

AssertionError: ColumnBase are different
values are different (0.0 %)
[left]:  [1.5 nan 2.5]
[right]: [1.5 2.5]

Notice that in [right] the NaN is missing. Also notice in the section below, that the cudf has two types of NaN: <NA> and NaN. despite both assigned cp.nan

Expected behavior
No error. Feel free to try the pandas alternative:

pd.testing.assert_series_equal(data.to_pandas(),  expect.to_pandas(), 'Error in Series equal')  --> no assertionerror

data.to_pandas()
Out:
0    1.5
1    NaN
2    2.5
dtype: float64

expect.to_pandas()
Out:
0    1.5
1    NaN
2    2.5
dtype: float64

data
Out:
0    1.5
1    NaN
2    2.5
dtype: float64

expect
Out:
0     1.5
1    <NA>
2     2.5
dtype: float64

Environment overview (please complete the following information)
Rapids 0.19 install via conda

@esnvidia esnvidia added Needs Triage Need team to review and classify bug Something isn't working labels Jun 14, 2021
@quasiben quasiben added the Python Affects Python cuDF API. label Jun 14, 2021
@quasiben
Copy link
Member

@brandon-b-miller is this something which could be resolved by #8442 ?

@brandon-b-miller
Copy link
Contributor

Not sure - I suspect no, but am digging in now.

@brandon-b-miller
Copy link
Contributor

slightly simpler repro

import cupy as cp
import cudf

data = cudf.Series(cp.array([0.0, 0.0, 0.0]))
expect = cudf.Series([0.0, cp.nan, 0.0])

data.iloc[1] = cp.nan

cudf.testing.assert_series_equal(data, expect, 'Error in Series equal')

@brandon-b-miller
Copy link
Contributor

This is a combination of two things:

  1. when we construct from a cupy array with nans in it, we coerce those to nulls. This behavior is debatable.
  2. cudf.testing.assert_series_equal uses to_array() to show the array differences, which is possibly incorrectly filtering the nulls. What to do with nulls during to_array() is also debatable. In any event though, it can at least be changed to show the actual reason the arrays are different (one has a null and the other has a nan) rather than erroneously erroring the way it currently does.

@brandon-b-miller brandon-b-miller self-assigned this Jun 14, 2021
@brandon-b-miller brandon-b-miller removed the Needs Triage Need team to review and classify label Jun 14, 2021
@esnvidia
Copy link
Author

esnvidia commented Jun 15, 2021

Here's my (likely slower) workaround:

from typing import Optional

def assert_series_equal(df1: cudf.Series, df2: cudf.Series, msg: Optional[str]=None):
    if not msg:
        msg = ''
    if (len(df1) != len(df2)) or ((df1.index == df2.index).sum() != len(df1)):
        # check lengths equal and indices align
        raise AssertionError('df1 and df2 indices do not match', msg)
    elif (df1.isna() == df2.isna()).sum() != len(df1):
        # check all nulls equal and all non-nulls equal a non-null.
        raise AssertionError('df1 and df2 NaNs do not match', msg)
    elif (df1.loc[~df1.isna()] != df2.loc[~df2.isna()]).any():
        # check all non nulls equal
        raise AssertionError('values in df1 do not match df2', msg)

feel free to test with the following:

v = cudf.Series(cp.array([1.5, cp.nan]))
w = cudf.Series([3.14, 0, 2.5])
x = cudf.Series([1.5, 0, 2.5])
y = cudf.Series([1.5, cp.nan, 2.5])
z = cudf.Series(cp.array([1.5, 0.5, 2.5 ]))
z.iloc[1] = cp.nan
assert_series_equal(y,z) and assert_series_equal(z,y) and \
 assert_series_equal(v, z[:2]) and assert_series_equal(w[1:], x[1:])

and similar and reverse cases should be the only ones that work.

@github-actions
Copy link

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

rapids-bot bot pushed a commit that referenced this issue Feb 8, 2022
Fixes a bug where empty columns were not comparing correctly as well as a few edge cases with strings

Partially addresses #8513

Authors:
  - https://github.com/brandon-b-miller

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Michael Wang (https://github.com/isVoid)

URL: #10011
@vyasr
Copy link
Contributor

vyasr commented Aug 1, 2022

@brandon-b-miller of your two points above, item (2) is no longer relevant after #10011.

Regarding item 1, don't a lot of invariants break if we change the Series constructor to default to nan_as_null=False? More generally, isn't the default chosen because we assume based on empirical results that users typically have NaNs in pandas objects because they're representing nulls? The above example works exactly as expected if you construct expect with

expect = cudf.Series([1.5, cp.nan, 2.5], nan_as_null=False)

I don't think we can change the default behavior of nan_as_null without breaking many more users than we would help. CC @shwina in case he has a different opinion, but otherwise I don't think there's anything to be done aside from perhaps finding a better way to document how we ingest nan data in the constructors beyond just the API docs.

@brandon-b-miller
Copy link
Contributor

Yeah, I'd agree. Over the long term I'd rather not have nan_as_null at all since it conflates things that are distinct, but you're right, that's the way it is right now due to pandas defaults. The day that pandas nullable types are standard would probably be the right time to do away with it.

@vyasr
Copy link
Contributor

vyasr commented Aug 2, 2022

Sounds good. We can revisit removing nan_as_null when pandas 2.0 is released (I think proper nullable type support is on the roadmap for 2.0, but if not then it'll be later), but for now we can close this issue then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Python Affects Python cuDF API.
Projects
None yet
Development

No branches or pull requests

4 participants