Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use partial aggregation schema for spilling to avoid column mismatch in GroupedHashAggregateStream #13995

Merged
merged 15 commits into from
Jan 8, 2025

Conversation

kosiew
Copy link
Contributor

@kosiew kosiew commented Jan 3, 2025

Which issue does this PR close?

Closes #13949.

Rationale for this change

When an aggregation operator spills intermediate (partial) state to disk, it needs a schema that includes both the group-by columns and partial-aggregator columns (e.g., partial sums, counts, etc.). Previously, the code used the original input schema for spilling, which does not match the additional columns representing aggregator states. As a result, reading back the spilled data caused a mismatch error:

ArrowError(InvalidArgumentError(
  "number of columns(3) must match number of fields(2) in schema"
))

This PR addresses that by introducing a partial aggregation schema that combines group columns and aggregator state columns, ensuring consistency when spilling and later reading the spilled data.

What changes are included in this PR?

  1. A new helper function, build_partial_agg_schema(), creates a partial schema by merging:
  • Group-by fields
  • Each aggregator’s internal “state fields”
  1. The aggregate operator is updated to use this partial schema when spilling or merging spilled data rather than the original (input) schema, which fixes the column mismatch error.

Are these changes tested?

Yes

Are there any user-facing changes?

No

@github-actions github-actions bot added the physical-expr Physical Expressions label Jan 3, 2025
@github-actions github-actions bot added the core Core DataFusion crate label Jan 3, 2025
@kosiew kosiew marked this pull request as ready for review January 3, 2025 06:43
@kosiew kosiew changed the title Refactor spill handling in GroupedHashAggregateStream to use partial … Use partial aggregation schema for spilling to avoid column mismatch in GroupedHashAggregateStream Jan 3, 2025
Copy link
Contributor

@2010YOUY01 2010YOUY01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. I found the fix easy to follow 😄, and the change makes sense to me.

I have a suggestion to improve test coverage:
Since min/max only has one intermediate aggregate state (partial min/max), we should also test aggregate functions that produce more than one intermediate state, like avg (partial sum and count).
Duplicating the existing test and modifying one of the aggregate functions to avg should be sufficient.


let result =
common::collect(single_aggregate.execute(0, Arc::clone(&task_ctx))?).await?;

Copy link
Contributor

@2010YOUY01 2010YOUY01 Jan 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to add an assertion here to make sure spilling actually happened for certain test cases. Like:

        let metrics = single_aggregate.metrics();
        // ...and assert some metrics inside like 'spill count' is > 0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @2010YOUY01 for the review and suggestions.
I have implemented both.

@@ -2743,6 +2754,143 @@ mod tests {
Ok(())
}

// test for https://github.com/apache/datafusion/issues/13949
async fn run_test_with_spill_pool_if_necessary(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose it'll be better to move this test to other aggregate tests in datafusion/physical-plan/src/mod.rs

Copy link
Contributor Author

@kosiew kosiew Jan 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad, yes, I meant aggregates/mod.rs

@@ -522,7 +527,7 @@ impl GroupedHashAggregateStream {
let spill_state = SpillState {
spills: vec![],
spill_expr,
spill_schema: Arc::clone(&agg_schema),
spill_schema: partial_agg_schema,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like the issue was related only to AggregateMode::Single[Partitioned] cases, since for both Final and FinalPartitioned, there is a reassignment right before spilling (the new value is a schema for Partial output which is exactly group_by + state fields). Perhaps we can remove this reassignment now and rely on original spill_schema value set on stream creation (before removing it, we need to ensure that spill schema will be equal to intermediate result schema for any aggregation mode which supports spilling)?

Copy link
Contributor Author

@kosiew kosiew Jan 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @korowa ,

remove this reassignment now

In other words, remove these lines, am I correct?

// Use input batch (Partial mode) schema for spilling because
// the spilled data will be merged and re-evaluated later.
self.spill_state.spill_schema = batch.schema();

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this line seems to be redundant now -- I'd expect all aggregation modes to have the same spill schema (which is set by this PR), so it shouldn't depend on stream input anymore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for confirming.
The lines are removed.

/// This helper function constructs such a schema:
/// `[group_col_1, group_col_2, ..., state_col_1, state_col_2, ...]`
/// so that partial aggregation data can be handled consistently.
fn build_partial_agg_schema(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps instead of the new helper we could reuse aggregates::create_schema?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked create_schema and it handles aggregates like MIN, MAX well but it does not handle AVG which has multiple intermediate states (partial sum, partial count).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm not mistaken, it should for mode = AggregateMode::Partial -- for this case it also returns state_fields instead of result field

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aaa..... 🤔
Thanks for the pointer. It does work.

Copy link
Contributor

@korowa korowa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you @kosiew @2010YOUY01

Going to merge it tomorrow, in case anyone else would like to review it.

@@ -43,6 +38,10 @@ use crate::physical_plan::{
ExecutionPlan, SendableRecordBatchStream,
};
use crate::prelude::SessionContext;
use std::any::Any;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: this import reordering can be reverted to leave the file unmodified

@korowa korowa merged commit 81b50c4 into apache:main Jan 8, 2025
25 checks passed
@alamb
Copy link
Contributor

alamb commented Jan 8, 2025

❤️

@Friede80
Copy link

Friede80 commented Jan 8, 2025

Thanks for the rapid fix, @kosiew!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate physical-expr Physical Expressions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Schema error when spilling with multiple aggregations
5 participants