Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raise a plan error on union if column count is not the same between plans #13117

Merged
merged 1 commit into from
Oct 28, 2024

Conversation

Omega359
Copy link
Contributor

Which issue does this PR close?

Closes #13092

Rationale for this change

Raise the error earlier, restore the behaviour of previous versions of DF

What changes are included in this PR?

code, slt test.

Are these changes tested?

yes

Are there any user-facing changes?

no

@github-actions github-actions bot added logical-expr Logical plan and expressions sqllogictest SQL Logic Tests (.slt) labels Oct 25, 2024
@Omega359 Omega359 marked this pull request as ready for review October 26, 2024 02:06
if left_plan.schema().fields().len() != right_plan.schema().fields().len() {
return plan_err!(
"UNION queries have different number of columns: \
left has {} columns whereas right has {} columns",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good check to have. 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I agree -- it is an easy check to add that can catch clearly incorrect plans (though as @findepi notes it won't catch all issues)

@findepi
Copy link
Member

findepi commented Oct 26, 2024

This is good check to have. 👍

Obviously having same column count doesn't guarantee compatibility. The types need to be coercible, so normally the column count check should go together with the type check.
Reading rationale for #11961 it seems that we have two+ phases:

  1. "loose" Logical Plans where types may not match
    • this is apparently the phase what plan builder operates at, because it directly supports dataframe frontend, I believe
    • then we do "type coercion optimizer" (which is not really an optimizer)
  2. "strict" Logical Plans where types are expected to match

The consequence of this is that

  • it's unclear what guarantees are provided at which phase, because phases do not stand out in the code
  • we can still "type coercion optimizer" and other coercion related functions (like expr.get_type) on strict Logical Plans which cannot benefit from them

I would want to see these addressed, that's the idea of #12723.

Maybe we should have a form of validator of logical plan well-formedness (union branches match, the types match exactly) and the check would go there.

The downside of such approach (and of current state as well) is that, when you have code taking Logical Plan UNION, you can't really assume much (things may or may not have been checked), so the code needs to be very defensively written.
Wonder whether we can leverage Rust's compiler type system to help us with this.

cc @alamb @ozankabak @jonahgao

@alamb
Copy link
Contributor

alamb commented Oct 26, 2024

Maybe we should have a form of validator of logical plan well-formedness (union branches match, the types match exactly) and the check would go there.

I think this is what the the type coercion analyzer pass already does.

If the goal is to provide users errors sooner (when DataFrame is created) rather than later (when DataFrame is run) some other ideas:

  1. Add a function to DataFrame (DataFrame::optimize, DataFrame::validate, DataFrame::check that runs)
  2. Add some configuration / option that runs coercion logic automatically on each DataFrame call

@alamb
Copy link
Contributor

alamb commented Oct 26, 2024

Reading rationale for #11961 it seems that we have two+ phases:

I agree with this analysis and conclusion

type coercion optimizer" (which is not really an optimizer)

I agree it is not an optimizer -- it an Analyzer datafusion::optimizer::analyzer::type_coercion::TypeCoercion

I think the analyzer docs explain this: https://docs.rs/datafusion/latest/datafusion/optimizer/trait.AnalyzerRule.html

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Omega359 and @findepi

cc @jonahgao

if left_plan.schema().fields().len() != right_plan.schema().fields().len() {
return plan_err!(
"UNION queries have different number of columns: \
left has {} columns whereas right has {} columns",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I agree -- it is an easy check to add that can catch clearly incorrect plans (though as @findepi notes it won't catch all issues)

Copy link
Member

@jonahgao jonahgao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Omega359 . Raising error earlier makes sense to me

@jonahgao jonahgao merged commit e22d231 into apache:main Oct 28, 2024
24 checks passed
@Omega359 Omega359 deleted the feature/union_plan_error_13092 branch October 29, 2024 19:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
logical-expr Logical plan and expressions sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Datafusion 42 does not raise plan error when UNION branches have different number of columns
4 participants