Inconsistent null handling in min/max #8031

westonpace · 2023-11-02T15:10:16Z

Describe the bug

The datafusion_physical_expr::aggregate::min_max::MinAccumulator and MaxAccumulator appear to use different approaches to calculating the minimum and maximum at different times and this can lead to inconsistent results.

The min/max are extracted from batches using arrow_arith::aggregate::max and arrow_arith::aggregate::min which state:

For floating point arrays any NaN values are considered to be greater than any other non-null value

Once these values are extracted from the batch they are then compared with the current ScalarValue. I had assumed this was using the PartialOrd implementation of ScalarValue which uses total order (which would still be inconsistent).

However, it appears that f32::min and f32::max are used instead. These methods give us a delightful, third interpretation of floating point ordering which is:

If one of the arguments is NaN, then the other argument is returned.

This means that min considers NaN to be greater and max considers NaN to be smaller 😵

As a result, it seems the MinAccumulator is consistent (NaN is always greater) but not aligned with total ordering. The MaxAccumulator is inconsistent and can give different results depending on how the data is batched up.

To Reproduce

use std::sync::Arc;

use arrow::datatypes::DataType;
use arrow_array::{ArrayRef, Float32Array};
use datafusion::physical_plan::Accumulator;
use datafusion_physical_expr::expressions::MaxAccumulator;

pub fn main() {
    let pos_nan = f32::NAN;
    let zero = 0_f32;

    let vals: ArrayRef = Arc::new(Float32Array::from_iter_values(
        [zero, pos_nan].iter().copied(),
    ));

    // We accumulate the above two rows in two different ways.  First, we pass in both as a single batch
    // This will use `aggregate::max` which always puts `NaN` as the max
    let mut accumulator = MaxAccumulator::try_new(&DataType::Float32).unwrap();
    accumulator.update_batch(&[vals]).unwrap();
    dbg!(&accumulator.evaluate().unwrap()); // prints NaN

    // Next we pass the two values in two different batches.  This will use `f32.max` which never
    // puts `NaN` as the max
    let vals_a: ArrayRef = Arc::new(Float32Array::from_iter_values([pos_nan].iter().copied()));
    let vals_b: ArrayRef = Arc::new(Float32Array::from_iter_values([zero].iter().copied()));

    let mut accumulator = MaxAccumulator::try_new(&DataType::Float32).unwrap();
    accumulator.update_batch(&[vals_a]).unwrap();
    accumulator.update_batch(&[vals_b]).unwrap();
    dbg!(&accumulator.evaluate().unwrap()); // prints 0
}

Expected behavior

My preference would be that everything be done in total order. I am using ScalarValue's PartialOrd internally and I would prefer that it be consistent with min & max.

Additional context

No response

The text was updated successfully, but these errors were encountered:

westonpace · 2023-11-02T15:11:14Z

It's possible this would be addressed by #6842 (though I don't fully understand that ticket)

tustvold · 2023-11-02T15:16:52Z

FWIW I would not be opposed to changing the behaviour of the arrow kernels, tbh I thought they had been updated to use total ordering...

alamb · 2023-11-03T16:22:09Z

FWIW I think the min/max accumulators are some of the oldest code in DataFusion. I am very much on board with the idea of standardizing on the same (total ordering)

westonpace added the bug Something isn't working label Nov 2, 2023

tustvold mentioned this issue Nov 2, 2023

Min/Max Kernels Should Use Total Ordering apache/arrow-rs#5031

Closed

westonpace mentioned this issue May 22, 2024

fix: use total ordering in the min & max accumulator for floats #10627

Merged

alamb mentioned this issue May 23, 2024

Minor: Add tests showing aggregate behavior for NaNs #10634

Merged

alamb closed this as completed in #10627 Jun 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent null handling in min/max #8031

Inconsistent null handling in min/max #8031

westonpace commented Nov 2, 2023

westonpace commented Nov 2, 2023

tustvold commented Nov 2, 2023

alamb commented Nov 3, 2023

Inconsistent null handling in min/max #8031

Inconsistent null handling in min/max #8031

Comments

westonpace commented Nov 2, 2023

Describe the bug

To Reproduce

Expected behavior

Additional context

westonpace commented Nov 2, 2023

tustvold commented Nov 2, 2023

alamb commented Nov 3, 2023