Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Analyzer phase to DataFusion , add basic validation logic to Subquery Plans and Expressions #5570

Merged
merged 3 commits into from
Mar 16, 2023

Conversation

mingmwang
Copy link
Contributor

@mingmwang mingmwang commented Mar 13, 2023

Which issue does this PR close?

Related to #2361, #5463.

Rationale for this change

What changes are included in this PR?

This PR add an Analyzer phase to DataFusion. It also adds some basic validation logic to Subquery plans and expressions.
It does not check all the unsupported cases. This PR just come up with a framework to do the necessary checking:

  1. The allowed outer plans to use SubQuery expressions
  2. The allowed inner plans inside SubQuery plan
  3. The supported cases/positions to refer the correlated(outer) columns inside the inner plans/inner expressions.

In the current DataFusion, there are other gaps to correctly identify the out reference columns in the SqlToRel process.
Those gaps will be addressed in other PRs.

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added core Core DataFusion crate optimizer Optimizer rules labels Mar 13, 2023
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mingmwang -- I think this PR's structure looks great.

Do you plan to move the https://github.com/apache/arrow-datafusion/blob/main/datafusion/optimizer/src/type_coercion.rs logic into the analyzer as well?

I had some suggestions on how to reduce replication which I think would be nice but not strictly required.

use datafusion_expr::LogicalPlan;

/// a Trait for marking tree node types that are rewritable
pub trait TreeNodeRewritable: Clone {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This appears to largely be the same as https://github.com/apache/arrow-datafusion/blob/146a949218ec970784974137277cde3b4e547d0a/datafusion/physical-expr/src/rewrite.rs#L25 but they are both generic,

Would it be possible to move the trait into datafusion_common and only keep the impls in logical and physical plan?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the past, I had tried to move the Trait to a common place but failed due to Rust's orphan rule limitations.
Either the trait or the type for which we are implementing the trait must be defined in the same crate. The major problem is the Arc struct is from the std crate.

impl TreeNodeRewritable for Arc<dyn ExecutionPlan>

impl TreeNodeRewritable for Arc<dyn PhysicalExpr>

@alamb
Copy link
Contributor

alamb commented Mar 13, 2023

cc @jackwener @ygf11 @liukun4515

@mingmwang
Copy link
Contributor Author

Thanks @mingmwang -- I think this PR's structure looks great.

Do you plan to move the https://github.com/apache/arrow-datafusion/blob/main/datafusion/optimizer/src/type_coercion.rs logic into the analyzer as well?

I had some suggestions on how to reduce replication which I think would be nice but not strictly required.

Sure, I can work on this in future. But recently I plan to address the subquery related gaps first.

#5483

Copy link
Member

@jackwener jackwener left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A great job🚀


/// `AnalyzerRule` transforms the unresolved ['LogicalPlan']s and unresolved ['Expr']s into
/// the resolved form.
pub trait AnalyzerRule {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel we should add a new dir for analyzer.
Because we can pull some analyzer rule into it.

})
}

fn check_subquery_expr(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the future, we also can move type_coercion into analyzer

@Jefffrey
Copy link
Contributor

Jefffrey commented Mar 14, 2023

Will these new analyzer phase rules be visible in output of explain commands? If not, then worth a follow-on ticket to track it

@mingmwang
Copy link
Contributor Author

Will these new analyzer phase rules be visible in output of explain commands? If not, then worth a follow-on ticket to track it

Just log a ticket to track the effort.
#5598

@alamb
Copy link
Contributor

alamb commented Mar 15, 2023

I think this is ready to go -- any concerns about merging @mingmwang ?

@mingmwang
Copy link
Contributor Author

I think this is ready to go -- any concerns about merging @mingmwang ?

Sure, please help to merge this.

@jackwener jackwener merged commit 258af4b into apache:main Mar 16, 2023
@andygrove andygrove added the enhancement New feature or request label Apr 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate enhancement New feature or request optimizer Optimizer rules
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants