You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The datafusion_physical_expr::aggregate::min_max::MinAccumulator and MaxAccumulator appear to use different approaches to calculating the minimum and maximum at different times and this can lead to inconsistent results.
For floating point arrays any NaN values are considered to be greater than any other non-null value
Once these values are extracted from the batch they are then compared with the current ScalarValue. I had assumed this was using the PartialOrd implementation of ScalarValue which uses total order (which would still be inconsistent).
However, it appears that f32::min and f32::max are used instead. These methods give us a delightful, third interpretation of floating point ordering which is:
If one of the arguments is NaN, then the other argument is returned.
This means that min considers NaN to be greater and max considers NaN to be smaller 😵
As a result, it seems the MinAccumulator is consistent (NaN is always greater) but not aligned with total ordering. The MaxAccumulator is inconsistent and can give different results depending on how the data is batched up.
To Reproduce
use std::sync::Arc;
use arrow::datatypes::DataType;
use arrow_array::{ArrayRef, Float32Array};
use datafusion::physical_plan::Accumulator;
use datafusion_physical_expr::expressions::MaxAccumulator;
pub fn main() {
let pos_nan = f32::NAN;
let zero = 0_f32;
let vals: ArrayRef = Arc::new(Float32Array::from_iter_values(
[zero, pos_nan].iter().copied(),
));
// We accumulate the above two rows in two different ways. First, we pass in both as a single batch
// This will use `aggregate::max` which always puts `NaN` as the max
let mut accumulator = MaxAccumulator::try_new(&DataType::Float32).unwrap();
accumulator.update_batch(&[vals]).unwrap();
dbg!(&accumulator.evaluate().unwrap()); // prints NaN
// Next we pass the two values in two different batches. This will use `f32.max` which never
// puts `NaN` as the max
let vals_a: ArrayRef = Arc::new(Float32Array::from_iter_values([pos_nan].iter().copied()));
let vals_b: ArrayRef = Arc::new(Float32Array::from_iter_values([zero].iter().copied()));
let mut accumulator = MaxAccumulator::try_new(&DataType::Float32).unwrap();
accumulator.update_batch(&[vals_a]).unwrap();
accumulator.update_batch(&[vals_b]).unwrap();
dbg!(&accumulator.evaluate().unwrap()); // prints 0
}
Expected behavior
My preference would be that everything be done in total order. I am using ScalarValue's PartialOrd internally and I would prefer that it be consistent with min & max.
Additional context
No response
The text was updated successfully, but these errors were encountered:
FWIW I think the min/max accumulators are some of the oldest code in DataFusion. I am very much on board with the idea of standardizing on the same (total ordering)
Describe the bug
The
datafusion_physical_expr::aggregate::min_max::MinAccumulator
andMaxAccumulator
appear to use different approaches to calculating the minimum and maximum at different times and this can lead to inconsistent results.The min/max are extracted from batches using
arrow_arith::aggregate::max
andarrow_arith::aggregate::min
which state:Once these values are extracted from the batch they are then compared with the current
ScalarValue
. I had assumed this was using thePartialOrd
implementation ofScalarValue
which uses total order (which would still be inconsistent).However, it appears that
f32::min
andf32::max
are used instead. These methods give us a delightful, third interpretation of floating point ordering which is:This means that
min
considersNaN
to be greater andmax
considersNaN
to be smaller 😵As a result, it seems the
MinAccumulator
is consistent (NaN
is always greater) but not aligned with total ordering. TheMaxAccumulator
is inconsistent and can give different results depending on how the data is batched up.To Reproduce
Expected behavior
My preference would be that everything be done in total order. I am using
ScalarValue
'sPartialOrd
internally and I would prefer that it be consistent with min & max.Additional context
No response
The text was updated successfully, but these errors were encountered: