Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't preserve functional dependency when generating UNION logical plan #12979

Merged
merged 1 commit into from
Oct 20, 2024

Conversation

Sevenannn
Copy link
Contributor

@Sevenannn Sevenannn commented Oct 17, 2024

Which issue does this PR close?

Closes #12980

Rationale for this change

When the datafusion logical planner build the AGGREGATE plan, it adds additional columns in the group_expr based on the functional dependencies. However, for queries that are aggregating upon table obatined through UNION operation, the functional dependency is still preserved in the schema of UNION plan, while the functional dependency no longer retains after the UNION. This causes wrong column being added as group_by column in aggregation plan

What changes are included in this PR?

  • Changes to eliminate functional dependency when building UNION logical plan
  • Unit test to verify the changes

Are these changes tested?

Yes, unit test test_aggregate_with_union is added to verify the change

Are there any user-facing changes?

Aggregation based upon UNION results will not produce wrong results with duplicated groups

…an (#44)

* Don't preserve functional dependency when generating UNION logical plan

* Remove extra lines
@github-actions github-actions bot added logical-expr Logical plan and expressions core Core DataFusion crate labels Oct 17, 2024
@Sevenannn Sevenannn marked this pull request as ready for review October 17, 2024 03:52
@sgrebnov
Copy link
Member

LGTM 👍

Copy link
Contributor

@berkaysynnada berkaysynnada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix @Sevenannn. I have just one comment to consider. Otherwise, LGTM

datafusion/core/src/dataframe/mod.rs Show resolved Hide resolved
let schema = (**left_plan.schema()).clone();
let schema =
Arc::new(schema.with_functional_dependencies(FunctionalDependencies::empty())?);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is clearing out all dependencies the right fix? Could we retain some if they do not harm?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll wait to merge this PR until tomorrow to give @Sevenannn a chance to respond

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @berkaysynnada, thanks for the review! I don’t think any FD persists when performing UNION on 2 tables.

A simple example would be UNION table t1 with another table t2 which only has 1 row, there always exists such data in t2 which could break the FDs in t1 / t2 after the UNION.

In this case, clearing FDs would be the right fix since we don’t want FDs to get wrongly retained and affect later plans, e.g. aggregation.

Please let me know if you have any further questions regarding this PR, thanks!

@Sevenannn Sevenannn requested a review from alamb October 18, 2024 23:58
@alamb alamb merged commit 8d4614d into apache:main Oct 20, 2024
25 checks passed
@alamb
Copy link
Contributor

alamb commented Oct 20, 2024

Thanks again @Sevenannn and @sgrebnov and @berkaysynnada

wiedld pushed a commit to influxdata/arrow-datafusion that referenced this pull request Dec 6, 2024
…an (#44) (apache#12979)

* Don't preserve functional dependency when generating UNION logical plan

* Remove extra lines
alamb pushed a commit to influxdata/arrow-datafusion that referenced this pull request Dec 7, 2024
…an (#44) (apache#12979)

* Don't preserve functional dependency when generating UNION logical plan

* Remove extra lines
alamb pushed a commit to influxdata/arrow-datafusion that referenced this pull request Dec 7, 2024
…an (#44) (apache#12979)

* Don't preserve functional dependency when generating UNION logical plan

* Remove extra lines
wiedld pushed a commit to influxdata/arrow-datafusion that referenced this pull request Dec 12, 2024
…an (#44) (apache#12979)

* Don't preserve functional dependency when generating UNION logical plan

* Remove extra lines
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate logical-expr Logical plan and expressions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Functional dependency shouldn't be preserved in UNION logical plan
4 participants