Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix panic in median "AggregateState is not a scalar aggregate" #4488

Merged
merged 3 commits into from
Dec 9, 2022

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Dec 2, 2022

Which issue does this PR close?

closes #3105

Rationale for this change

I think median as currently implemented is unusable except for everything but the simplest of queries (where there is no intermediate repartitioning)

Median was the only aggregate that stored it state using AggregateState::Array while all other aggregages use AggregateState::Scalar

It turns out that AggregateState::Array doesn't support partial aggregates where partial state has to be marshalled into an Array and combined. Thus, median likely never worked for any DataFusion plan that has more than one partition (where the data is repartitioned for parallel aggregation).

What changes are included in this PR?

I tried a few ways to implement partial state for AggregateState::Array and they all got messy very quickly.

Thus, I opted for what I think is the simpler, possibly less performant, approach of using Scalars rather than Arrays.

Median was added in #3009 by @andygrove (and sadly I think I have have suggested the array based approach without fully understanding the implications)

For any real world dataset where the performance different between approaches would be measurable, I believe the the current implementation is going to error anyways.

This approach will also extend nicely to Decimal as well

Are these changes tested?

Yes

Are there any user-facing changes?

Less error

Proposed follow on

If this approach is accepted, I propose removing the AggregateState enum and update the Accumulator trait to simplify it as a follow on PR.

cc @andygrove as he originally wrote this code in #3009

cc @tustvold as I think he is thinking about how to improve the grouping situation overall

@@ -436,6 +436,90 @@ async fn median_test(
Ok(())
}

#[tokio::test]
// test case for https://github.com/apache/arrow-datafusion/issues/3105
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all of the tests added in this file panic on master

@alamb alamb marked this pull request as ready for review December 2, 2022 21:26
@@ -720,7 +720,7 @@ impl std::hash::Hash for ScalarValue {
/// dictionary array
#[inline]
fn get_dict_value<K: ArrowDictionaryKeyType>(
array: &ArrayRef,
array: &dyn Array,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Dec 9, 2022
@@ -216,6 +216,70 @@ SELECT approx_median(a) FROM median_f64_nan
----
NaN

# median_multi
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ported the tests to sqllogictest as much of the rest of the aggregate tests had been ported too

@alamb alamb merged commit 31bbe6c into apache:master Dec 9, 2022
@alamb alamb deleted the alamb/fix_median branch December 9, 2022 12:04
jonmmease pushed a commit to jonmmease/arrow-datafusion that referenced this pull request Dec 20, 2022
…e#4488)

* Fix panic in median "AggregateState is not a scalar aggregate"

* Apply suggestions from code review

Co-authored-by: Raphael Taylor-Davies <[email protected]>

Co-authored-by: Raphael Taylor-Davies <[email protected]>

(cherry picked from commit 31bbe6c)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate physical-expr Physical Expressions sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Median aggregation using DataFrame panics: "AggregateState is not a scalar aggregate"
2 participants