-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix panic in median "AggregateState is not a scalar aggregate" #4488
Conversation
@@ -436,6 +436,90 @@ async fn median_test( | |||
Ok(()) | |||
} | |||
|
|||
#[tokio::test] | |||
// test case for https://github.com/apache/arrow-datafusion/issues/3105 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all of the tests added in this file panic on master
@@ -720,7 +720,7 @@ impl std::hash::Hash for ScalarValue { | |||
/// dictionary array | |||
#[inline] | |||
fn get_dict_value<K: ArrowDictionaryKeyType>( | |||
array: &ArrayRef, | |||
array: &dyn Array, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
Co-authored-by: Raphael Taylor-Davies <[email protected]>
@@ -216,6 +216,70 @@ SELECT approx_median(a) FROM median_f64_nan | |||
---- | |||
NaN | |||
|
|||
# median_multi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ported the tests to sqllogictest as much of the rest of the aggregate tests had been ported too
…e#4488) * Fix panic in median "AggregateState is not a scalar aggregate" * Apply suggestions from code review Co-authored-by: Raphael Taylor-Davies <[email protected]> Co-authored-by: Raphael Taylor-Davies <[email protected]> (cherry picked from commit 31bbe6c)
Which issue does this PR close?
closes #3105
Rationale for this change
I think median as currently implemented is unusable except for everything but the simplest of queries (where there is no intermediate repartitioning)
Median was the only aggregate that stored it state using
AggregateState::Array
while all other aggregages useAggregateState::Scalar
It turns out that
AggregateState::Array
doesn't support partial aggregates where partial state has to be marshalled into anArray
and combined. Thus,median
likely never worked for any DataFusion plan that has more than one partition (where the data is repartitioned for parallel aggregation).What changes are included in this PR?
I tried a few ways to implement partial state for
AggregateState::Array
and they all got messy very quickly.Thus, I opted for what I think is the simpler, possibly less performant, approach of using Scalars rather than Arrays.
Median was added in #3009 by @andygrove (and sadly I think I have have suggested the array based approach without fully understanding the implications)
For any real world dataset where the performance different between approaches would be measurable, I believe the the current implementation is going to error anyways.
This approach will also extend nicely to Decimal as well
Are these changes tested?
Yes
Are there any user-facing changes?
Less error
Proposed follow on
If this approach is accepted, I propose removing the AggregateState enum and update the Accumulator trait to simplify it as a follow on PR.
cc @andygrove as he originally wrote this code in #3009
cc @tustvold as I think he is thinking about how to improve the grouping situation overall