Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent null handling in min/max #8031

Closed
westonpace opened this issue Nov 2, 2023 · 3 comments · Fixed by #10627
Closed

Inconsistent null handling in min/max #8031

westonpace opened this issue Nov 2, 2023 · 3 comments · Fixed by #10627
Labels
bug Something isn't working

Comments

@westonpace
Copy link
Member

Describe the bug

The datafusion_physical_expr::aggregate::min_max::MinAccumulator and MaxAccumulator appear to use different approaches to calculating the minimum and maximum at different times and this can lead to inconsistent results.

The min/max are extracted from batches using arrow_arith::aggregate::max and arrow_arith::aggregate::min which state:

For floating point arrays any NaN values are considered to be greater than any other non-null value

Once these values are extracted from the batch they are then compared with the current ScalarValue. I had assumed this was using the PartialOrd implementation of ScalarValue which uses total order (which would still be inconsistent).

However, it appears that f32::min and f32::max are used instead. These methods give us a delightful, third interpretation of floating point ordering which is:

If one of the arguments is NaN, then the other argument is returned.

This means that min considers NaN to be greater and max considers NaN to be smaller 😵

As a result, it seems the MinAccumulator is consistent (NaN is always greater) but not aligned with total ordering. The MaxAccumulator is inconsistent and can give different results depending on how the data is batched up.

To Reproduce

use std::sync::Arc;

use arrow::datatypes::DataType;
use arrow_array::{ArrayRef, Float32Array};
use datafusion::physical_plan::Accumulator;
use datafusion_physical_expr::expressions::MaxAccumulator;

pub fn main() {
    let pos_nan = f32::NAN;
    let zero = 0_f32;

    let vals: ArrayRef = Arc::new(Float32Array::from_iter_values(
        [zero, pos_nan].iter().copied(),
    ));

    // We accumulate the above two rows in two different ways.  First, we pass in both as a single batch
    // This will use `aggregate::max` which always puts `NaN` as the max
    let mut accumulator = MaxAccumulator::try_new(&DataType::Float32).unwrap();
    accumulator.update_batch(&[vals]).unwrap();
    dbg!(&accumulator.evaluate().unwrap()); // prints NaN

    // Next we pass the two values in two different batches.  This will use `f32.max` which never
    // puts `NaN` as the max
    let vals_a: ArrayRef = Arc::new(Float32Array::from_iter_values([pos_nan].iter().copied()));
    let vals_b: ArrayRef = Arc::new(Float32Array::from_iter_values([zero].iter().copied()));

    let mut accumulator = MaxAccumulator::try_new(&DataType::Float32).unwrap();
    accumulator.update_batch(&[vals_a]).unwrap();
    accumulator.update_batch(&[vals_b]).unwrap();
    dbg!(&accumulator.evaluate().unwrap()); // prints 0
}

Expected behavior

My preference would be that everything be done in total order. I am using ScalarValue's PartialOrd internally and I would prefer that it be consistent with min & max.

Additional context

No response

@westonpace westonpace added the bug Something isn't working label Nov 2, 2023
@westonpace
Copy link
Member Author

It's possible this would be addressed by #6842 (though I don't fully understand that ticket)

@tustvold
Copy link
Contributor

tustvold commented Nov 2, 2023

FWIW I would not be opposed to changing the behaviour of the arrow kernels, tbh I thought they had been updated to use total ordering...

@alamb
Copy link
Contributor

alamb commented Nov 3, 2023

FWIW I think the min/max accumulators are some of the oldest code in DataFusion. I am very much on board with the idea of standardizing on the same (total ordering)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants