[FEA] Add binop operators that match Spark's expected NaN behavior #4752

razajafri · 2020-03-31T04:51:49Z

Describe the bug
Spark expects Nans to perform differently from other languages. Currently, cudf isn't behaving how we expect it to behave as per spark

Steps/Code to reproduce bug

TEST_F(BinaryOperationIntegrationTest, Greater_Scalar_Vector_Nan) {
  auto lhs = cudf::make_numeric_scalar(cudf::data_type(cudf::FLOAT32));
  auto s = 
    static_cast< cudf::experimental::scalar_type_t<float>* >(lhs.get());
  s->set_value(NAN);
  cudf::test::fixed_width_column_wrapper<float> rhs{{INFINITY, 1.02, 5.0, NAN, -43.2, -INFINITY}};
  using TypeOut = cudf::experimental::bool8;
  using GREATER = cudf::library::operation::Greater<TypeOut, float, float>;
  cudf::test::fixed_width_column_wrapper<TypeOut> expected{true, true, true, false, true, true};
  auto out = cudf::experimental::binary_operation(
      *lhs, rhs, cudf::experimental::binary_operator::GREATER,
      data_type(experimental::type_to_id<TypeOut>()));  
  cudf::test::expect_columns_equal(expected, out->view());
}

TEST_F(BinaryOperationIntegrationTest, Equal_Scalar_Vector_Nan) {
  auto lhs = cudf::make_numeric_scalar(cudf::data_type(cudf::FLOAT32));
  auto s = 
    static_cast< cudf::experimental::scalar_type_t<float>* >(lhs.get());
  s->set_value(NAN);
  cudf::test::fixed_width_column_wrapper<float> rhs{{INFINITY, 1.02, 5.0, NAN, -43.2, -INFINITY}};
  using TypeOut = cudf::experimental::bool8;
  using LESS = cudf::library::operation::Less<TypeOut, float, float>;
  cudf::test::fixed_width_column_wrapper<TypeOut> expected{false, false, false, true, false, false};
  auto out = cudf::experimental::binary_operation(
      *lhs, rhs, cudf::experimental::binary_operator::EQUAL,
      data_type(experimental::type_to_id<TypeOut>()));  
  cudf::test::expect_columns_equal(expected, out->view());
}

The above tests fail with the following error

[ RUN      ] BinaryOperationIntegrationTest.Greater_Scalar_Vector_Nan
../tests/utilities/column_utilities.cu:237: Failure
      Expected: differences.size()
      Which is: 5
To be equal to: size_t{0}
      Which is: 0
first difference: lhs[0] = 1, rhs[0] = 0
[  FAILED  ] BinaryOperationIntegrationTest.Greater_Scalar_Vector_Nan (1 ms)
[ RUN      ] BinaryOperationIntegrationTest.Equal_Scalar_Vector_Nan
../tests/utilities/column_utilities.cu:237: Failure
      Expected: differences.size()
      Which is: 1
To be equal to: size_t{0}
      Which is: 0
first difference: lhs[3] = 1, rhs[3] = 0
[  FAILED  ] BinaryOperationIntegrationTest.Equal_Scalar_Vector_Nan (240 ms)

Here is the output from the Java unit test displaying the entire column output

  @Test
  void testNanFloatsComparisons() {
    try (ColumnVector floats = ColumnVector.fromBoxedFloats(Float.POSITIVE_INFINITY, 1.02f, 5.0f, Float.NaN, -43.2f, Float.NEGATIVE_INFINITY);
        Scalar nan = Scalar.fromFloat(Float.NaN);
        ColumnVector floatsGreatherThanNan = floats.greaterThan(nan);
        ColumnVector floatsLessThan = floats.lessThan(nan);
        ColumnVector floatsEquals = floats.equalTo(nan)) {
      HostColumnVector floatsGreaterThan_h = floatsGreatherThanNan.copyToHost();
      HostColumnVector floatsLessThan_h = floatsLessThan.copyToHost();
      HostColumnVector floatsEqualTo_h = floatsEquals.copyToHost();
      for(int i = 0 ; i < floatsGreaterThan_h.getRowCount() ; i++) {
        System.out.printf(""| %b | %b | %b |\n"", floatsGreaterThan_h.getBoolean(i), floatsLessThan_h.getBoolean(i), floatsEqualTo_h.getBoolean(i));
      }
    }
  }

+-------+-------+-------+
| gt    |  lt   |  eq   |
+-------+-------+-------+
| false | false | false |
| false | false | false |
| false | false | false |
| false | false | false |
| false | false | false |
| false | false | false |
+-------+-------+-------+

Expected behavior
The above tests should pass

Additional context
The above tests are for float32 but similar test should pass for float64

The text was updated successfully, but these errors were encountered:

harrism · 2020-03-31T04:56:31Z

So that readers don't have to compile and run your tests just to think about this bug, please print out the values you get in out for each test and paste them here. Please include what spark expects as well.

razajafri · 2020-03-31T05:54:11Z

Thanks for looking @harrism. I have updated the bug with the test output. I will update it further with the entire output in the AM

jrhemstad · 2020-03-31T13:38:17Z

It would be a lot easier to understand the issue if the expected/actual behavior were summarized in a table. Can you please fill in the Actual column?

NaN > x

x	Expected	Actual
INFINITY	true	false
1.02	true	false
5.0	true	false
NaN	false	false
-43.2	true	false
-INFINITY	true	false

NaN == x

x	Expected	Actual
INFINITY	false	false
1.02	false	false
5.0	false	false
NaN	true	false
-43.2	false	false
-INFINITY	false	false

shwina · 2020-03-31T14:05:36Z

Pandas behaviour in the above cases:

In [8]: x = pd.Series([np.inf, 1.02, 5.0, np.nan, -43.2, -np.inf])

In [9]: np.nan > x
Out[9]:
0    False
1    False
2    False
3    False
4    False
5    False
dtype: bool

In [10]: np.nan == x
Out[10]:
0    False
1    False
2    False
3    False
4    False
5    False
dtype: bool

In [11]: np.nan != x
Out[11]:
0    True
1    True
2    True
3    True
4    True
5    True
dtype: bool

This looks to be inline with the IEE 754 standard.

jrhemstad · 2020-03-31T14:21:12Z

Seeing as though Spark's behavior is non-conformant with IEEE 754, I'm going to label this as a feature request rather than a bug. Supporting this behavior in libcudf will require adding new binop operators that support the non-conformant behavior, like SPARK_MAX/SPARK_MIN.

razajafri · 2020-03-31T19:13:52Z

@jrhemstad I have updated the table.

razajafri · 2020-04-11T19:08:25Z

@jrhemstad is this still valid? considering we have decided to abide by the NaN behavior set by IEEE 754

jrhemstad · 2020-04-13T13:31:12Z

@jrhemstad is this still valid? considering we have decided to abide by the NaN behavior set by IEEE 754

Thanks for the reminder.

Per conversation in #4760, this should be implemented via composition of other libcudf features. Closing.

razajafri added bug Something isn't working Needs Triage Need team to review and classify labels Mar 31, 2020

razajafri added the Spark Functionality that helps Spark RAPIDS label Mar 31, 2020

harrism added the libcudf Affects libcudf (C++/CUDA) code. label Mar 31, 2020

jrhemstad changed the title ~~[BUG] Nan comparisons are not what spark expects~~ [BUG] NaN comparisons in binops do not match what Spark expects Mar 31, 2020

jrhemstad added feature request New feature or request and removed Needs Triage Need team to review and classify bug Something isn't working labels Mar 31, 2020

jrhemstad changed the title ~~[BUG] NaN comparisons in binops do not match what Spark expects~~ [FEA] NaN comparisons in binops do not match what Spark expects Mar 31, 2020

jrhemstad mentioned this issue Mar 31, 2020

[FEA] Groupby MIN/MAX with NaN values does not match what Spark expects #4753

Open

jrhemstad changed the title ~~[FEA] NaN comparisons in binops do not match what Spark expects~~ [FEA] Add binop operators that match Spark's expected NaN behavior Mar 31, 2020

jrhemstad mentioned this issue Mar 31, 2020

[DISCUSSION] Behavior for NaN comparisons in libcudf #4760

Closed

jrhemstad closed this as completed Apr 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Add binop operators that match Spark's expected NaN behavior #4752

[FEA] Add binop operators that match Spark's expected NaN behavior #4752

razajafri commented Mar 31, 2020 •

edited by devavret

Loading

harrism commented Mar 31, 2020

razajafri commented Mar 31, 2020

jrhemstad commented Mar 31, 2020 •

edited by razajafri

Loading

shwina commented Mar 31, 2020

jrhemstad commented Mar 31, 2020 •

edited

Loading

razajafri commented Mar 31, 2020

razajafri commented Apr 11, 2020

jrhemstad commented Apr 13, 2020

[FEA] Add binop operators that match Spark's expected NaN behavior #4752

[FEA] Add binop operators that match Spark's expected NaN behavior #4752

Comments

razajafri commented Mar 31, 2020 • edited by devavret Loading

harrism commented Mar 31, 2020

razajafri commented Mar 31, 2020

jrhemstad commented Mar 31, 2020 • edited by razajafri Loading

shwina commented Mar 31, 2020

jrhemstad commented Mar 31, 2020 • edited Loading

razajafri commented Mar 31, 2020

razajafri commented Apr 11, 2020

jrhemstad commented Apr 13, 2020

razajafri commented Mar 31, 2020 •

edited by devavret

Loading

jrhemstad commented Mar 31, 2020 •

edited by razajafri

Loading

jrhemstad commented Mar 31, 2020 •

edited

Loading