Fix panic in median "AggregateState is not a scalar aggregate" #4488

alamb · 2022-12-02T19:40:08Z

Which issue does this PR close?

Rationale for this change

I think median as currently implemented is unusable except for everything but the simplest of queries (where there is no intermediate repartitioning)

Median was the only aggregate that stored it state using AggregateState::Array while all other aggregages use AggregateState::Scalar

It turns out that AggregateState::Array doesn't support partial aggregates where partial state has to be marshalled into an Array and combined. Thus, median likely never worked for any DataFusion plan that has more than one partition (where the data is repartitioned for parallel aggregation).

What changes are included in this PR?

I tried a few ways to implement partial state for AggregateState::Array and they all got messy very quickly.

Thus, I opted for what I think is the simpler, possibly less performant, approach of using Scalars rather than Arrays.

Median was added in #3009 by @andygrove (and sadly I think I have have suggested the array based approach without fully understanding the implications)

For any real world dataset where the performance different between approaches would be measurable, I believe the the current implementation is going to error anyways.

This approach will also extend nicely to Decimal as well

Are these changes tested?

Yes

Are there any user-facing changes?

Less error

Proposed follow on

If this approach is accepted, I propose removing the AggregateState enum and update the Accumulator trait to simplify it as a follow on PR.

cc @andygrove as he originally wrote this code in #3009

cc @tustvold as I think he is thinking about how to improve the grouping situation overall

alamb · 2022-12-02T19:41:30Z

datafusion/core/tests/sql/aggregates.rs

@@ -436,6 +436,90 @@ async fn median_test(
    Ok(())
 }

+#[tokio::test]
+// test case for https://github.com/apache/arrow-datafusion/issues/3105


all of the tests added in this file panic on master

tustvold · 2022-12-08T18:46:56Z

datafusion/common/src/scalar.rs

@@ -720,7 +720,7 @@ impl std::hash::Hash for ScalarValue {
 /// dictionary array
 #[inline]
 fn get_dict_value<K: ArrowDictionaryKeyType>(
-    array: &ArrayRef,
+    array: &dyn Array,


datafusion/physical-expr/src/aggregate/median.rs

Co-authored-by: Raphael Taylor-Davies <[email protected]>

alamb · 2022-12-09T11:11:02Z

datafusion/core/tests/sqllogictests/test_files/aggregate.slt

@@ -216,6 +216,70 @@ SELECT approx_median(a) FROM median_f64_nan
 ----
 NaN

+# median_multi


I ported the tests to sqllogictest as much of the rest of the aggregate tests had been ported too

datafusion/physical-expr/src/aggregate/median.rs

…e#4488) * Fix panic in median "AggregateState is not a scalar aggregate" * Apply suggestions from code review Co-authored-by: Raphael Taylor-Davies <[email protected]> Co-authored-by: Raphael Taylor-Davies <[email protected]> (cherry picked from commit 31bbe6c)

Fix panic in median "AggregateState is not a scalar aggregate"

056b485

github-actions bot added core Core DataFusion crate physical-expr Physical Expressions labels Dec 2, 2022

alamb mentioned this pull request Dec 2, 2022

Median aggregation using DataFrame panics: "AggregateState is not a scalar aggregate" #3105

Closed

alamb commented Dec 2, 2022

View reviewed changes

alamb mentioned this pull request Dec 2, 2022

Minor: Update docstrings and comments to aggregate code #4489

Merged

alamb marked this pull request as ready for review December 2, 2022 21:26

tustvold approved these changes Dec 8, 2022

View reviewed changes

alamb and others added 2 commits December 9, 2022 05:50

Apply suggestions from code review

5e72719

Co-authored-by: Raphael Taylor-Davies <[email protected]>

Merge remote-tracking branch 'apache/master' into alamb/fix_median

7ca666d

github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Dec 9, 2022

alamb commented Dec 9, 2022

View reviewed changes

alamb merged commit 31bbe6c into apache:master Dec 9, 2022

alamb deleted the alamb/fix_median branch December 9, 2022 12:04

alamb mentioned this pull request Dec 11, 2022

Remove AggregateState wrapper #4582

Merged

jonmmease mentioned this pull request Dec 20, 2022

Add median support vega/vegafusion#192

Merged

jonmmease mentioned this pull request Jan 6, 2023

VegaFusion: DataFusion 15.0 with fixes jonmmease/arrow-datafusion#124

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix panic in median "AggregateState is not a scalar aggregate" #4488

Fix panic in median "AggregateState is not a scalar aggregate" #4488

alamb commented Dec 2, 2022 •

edited

Loading

alamb Dec 2, 2022

tustvold Dec 8, 2022

alamb Dec 9, 2022

Fix panic in median "AggregateState is not a scalar aggregate" #4488

Fix panic in median "AggregateState is not a scalar aggregate" #4488

Conversation

alamb commented Dec 2, 2022 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Proposed follow on

alamb Dec 2, 2022

Choose a reason for hiding this comment

tustvold Dec 8, 2022

Choose a reason for hiding this comment

alamb Dec 9, 2022

Choose a reason for hiding this comment

alamb commented Dec 2, 2022 •

edited

Loading