Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Add binop operators that match Spark's expected NaN behavior #4752

Closed
razajafri opened this issue Mar 31, 2020 · 8 comments
Closed

[FEA] Add binop operators that match Spark's expected NaN behavior #4752

razajafri opened this issue Mar 31, 2020 · 8 comments
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS

Comments

@razajafri
Copy link
Contributor

razajafri commented Mar 31, 2020

Describe the bug
Spark expects Nans to perform differently from other languages. Currently, cudf isn't behaving how we expect it to behave as per spark

Steps/Code to reproduce bug

TEST_F(BinaryOperationIntegrationTest, Greater_Scalar_Vector_Nan) {
  auto lhs = cudf::make_numeric_scalar(cudf::data_type(cudf::FLOAT32));
  auto s = 
    static_cast< cudf::experimental::scalar_type_t<float>* >(lhs.get());
  s->set_value(NAN);
  cudf::test::fixed_width_column_wrapper<float> rhs{{INFINITY, 1.02, 5.0, NAN, -43.2, -INFINITY}};
  using TypeOut = cudf::experimental::bool8;
  using GREATER = cudf::library::operation::Greater<TypeOut, float, float>;
  cudf::test::fixed_width_column_wrapper<TypeOut> expected{true, true, true, false, true, true};
  auto out = cudf::experimental::binary_operation(
      *lhs, rhs, cudf::experimental::binary_operator::GREATER,
      data_type(experimental::type_to_id<TypeOut>()));  
  cudf::test::expect_columns_equal(expected, out->view());
}

TEST_F(BinaryOperationIntegrationTest, Equal_Scalar_Vector_Nan) {
  auto lhs = cudf::make_numeric_scalar(cudf::data_type(cudf::FLOAT32));
  auto s = 
    static_cast< cudf::experimental::scalar_type_t<float>* >(lhs.get());
  s->set_value(NAN);
  cudf::test::fixed_width_column_wrapper<float> rhs{{INFINITY, 1.02, 5.0, NAN, -43.2, -INFINITY}};
  using TypeOut = cudf::experimental::bool8;
  using LESS = cudf::library::operation::Less<TypeOut, float, float>;
  cudf::test::fixed_width_column_wrapper<TypeOut> expected{false, false, false, true, false, false};
  auto out = cudf::experimental::binary_operation(
      *lhs, rhs, cudf::experimental::binary_operator::EQUAL,
      data_type(experimental::type_to_id<TypeOut>()));  
  cudf::test::expect_columns_equal(expected, out->view());
}

The above tests fail with the following error

[ RUN      ] BinaryOperationIntegrationTest.Greater_Scalar_Vector_Nan
../tests/utilities/column_utilities.cu:237: Failure
      Expected: differences.size()
      Which is: 5
To be equal to: size_t{0}
      Which is: 0
first difference: lhs[0] = 1, rhs[0] = 0
[  FAILED  ] BinaryOperationIntegrationTest.Greater_Scalar_Vector_Nan (1 ms)
[ RUN      ] BinaryOperationIntegrationTest.Equal_Scalar_Vector_Nan
../tests/utilities/column_utilities.cu:237: Failure
      Expected: differences.size()
      Which is: 1
To be equal to: size_t{0}
      Which is: 0
first difference: lhs[3] = 1, rhs[3] = 0
[  FAILED  ] BinaryOperationIntegrationTest.Equal_Scalar_Vector_Nan (240 ms)

Here is the output from the Java unit test displaying the entire column output

  @Test
  void testNanFloatsComparisons() {
    try (ColumnVector floats = ColumnVector.fromBoxedFloats(Float.POSITIVE_INFINITY, 1.02f, 5.0f, Float.NaN, -43.2f, Float.NEGATIVE_INFINITY);
        Scalar nan = Scalar.fromFloat(Float.NaN);
        ColumnVector floatsGreatherThanNan = floats.greaterThan(nan);
        ColumnVector floatsLessThan = floats.lessThan(nan);
        ColumnVector floatsEquals = floats.equalTo(nan)) {
      HostColumnVector floatsGreaterThan_h = floatsGreatherThanNan.copyToHost();
      HostColumnVector floatsLessThan_h = floatsLessThan.copyToHost();
      HostColumnVector floatsEqualTo_h = floatsEquals.copyToHost();
      for(int i = 0 ; i < floatsGreaterThan_h.getRowCount() ; i++) {
        System.out.printf(""| %b | %b | %b |\n"", floatsGreaterThan_h.getBoolean(i), floatsLessThan_h.getBoolean(i), floatsEqualTo_h.getBoolean(i));
      }
    }
  }
+-------+-------+-------+
| gt    |  lt   |  eq   |
+-------+-------+-------+
| false | false | false |
| false | false | false |
| false | false | false |
| false | false | false |
| false | false | false |
| false | false | false |
+-------+-------+-------+

Expected behavior
The above tests should pass

Additional context
The above tests are for float32 but similar test should pass for float64

@razajafri razajafri added bug Something isn't working Needs Triage Need team to review and classify labels Mar 31, 2020
@razajafri razajafri added the Spark Functionality that helps Spark RAPIDS label Mar 31, 2020
@harrism
Copy link
Member

harrism commented Mar 31, 2020

So that readers don't have to compile and run your tests just to think about this bug, please print out the values you get in out for each test and paste them here. Please include what spark expects as well.

@razajafri
Copy link
Contributor Author

Thanks for looking @harrism. I have updated the bug with the test output. I will update it further with the entire output in the AM

@harrism harrism added the libcudf Affects libcudf (C++/CUDA) code. label Mar 31, 2020
@jrhemstad
Copy link
Contributor

jrhemstad commented Mar 31, 2020

It would be a lot easier to understand the issue if the expected/actual behavior were summarized in a table. Can you please fill in the Actual column?

NaN > x

x Expected Actual
INFINITY true false
1.02 true false
5.0 true false
NaN false false
-43.2 true false
-INFINITY true false

NaN == x

x Expected Actual
INFINITY false false
1.02 false false
5.0 false false
NaN true false
-43.2 false false
-INFINITY false false

@jrhemstad jrhemstad changed the title [BUG] Nan comparisons are not what spark expects [BUG] NaN comparisons in binops do not match what Spark expects Mar 31, 2020
@shwina
Copy link
Contributor

shwina commented Mar 31, 2020

Pandas behaviour in the above cases:

In [8]: x = pd.Series([np.inf, 1.02, 5.0, np.nan, -43.2, -np.inf])

In [9]: np.nan > x
Out[9]:
0    False
1    False
2    False
3    False
4    False
5    False
dtype: bool

In [10]: np.nan == x
Out[10]:
0    False
1    False
2    False
3    False
4    False
5    False
dtype: bool

In [11]: np.nan != x
Out[11]:
0    True
1    True
2    True
3    True
4    True
5    True
dtype: bool

This looks to be inline with the IEE 754 standard.

@jrhemstad jrhemstad added feature request New feature or request and removed Needs Triage Need team to review and classify bug Something isn't working labels Mar 31, 2020
@jrhemstad
Copy link
Contributor

jrhemstad commented Mar 31, 2020

Seeing as though Spark's behavior is non-conformant with IEEE 754, I'm going to label this as a feature request rather than a bug. Supporting this behavior in libcudf will require adding new binop operators that support the non-conformant behavior, like SPARK_MAX/SPARK_MIN.

@jrhemstad jrhemstad changed the title [BUG] NaN comparisons in binops do not match what Spark expects [FEA] NaN comparisons in binops do not match what Spark expects Mar 31, 2020
@jrhemstad jrhemstad changed the title [FEA] NaN comparisons in binops do not match what Spark expects [FEA] Add binop operators that match Spark's expected NaN behavior Mar 31, 2020
@razajafri
Copy link
Contributor Author

@jrhemstad I have updated the table.

@razajafri
Copy link
Contributor Author

@jrhemstad is this still valid? considering we have decided to abide by the NaN behavior set by IEEE 754

@jrhemstad
Copy link
Contributor

@jrhemstad is this still valid? considering we have decided to abide by the NaN behavior set by IEEE 754

Thanks for the reminder.

Per conversation in #4760, this should be implemented via composition of other libcudf features. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

No branches or pull requests

4 participants