-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for EXPLAIN ANALYZE #858
Conversation
@@ -213,6 +213,16 @@ pub enum LogicalPlan { | |||
/// The output schema of the explain (2 columns of text) | |||
schema: DFSchemaRef, | |||
}, | |||
/// Runs the actual plan, and then prints the physical plan with | |||
/// with execution metrics. | |||
Analyze { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NOTE: I chose a new LogicalPlan node because the implementation for ANALYZE is so different than EXPLAIN. However, it would be possible to re-use the same LogicalPlan node if people prefer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am about to ask when I saw the above code that has analyze as different function. I am worry about future inconsistency and headache of keeping them consistent, as well as redundant work when we change or improve something. I would prefer to keep them in the same LogicalPlan
97b5928
to
c7d1d4a
Compare
c7d1d4a
to
3ff7ffd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the fact you built this infrastructure before adding new coulters. It is easy to understand and review. My only concern is the analyze is implemented in different path of the explain which seems also in different path of the actual plan. I am not sure how tricky to keep them in the same path but if we can, it will help us keep the results consistent, avoid redundancy and headache for future work.
.and_then(|plan| plan.build()) | ||
.map_err(BallistaError::DataFusionError)?; | ||
|
||
roundtrip_test!(plan); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we talk I need to learn the the effectiveness of this round trip test thats convert a logical/physical plan into photo and back. The tests look simple and easy to understand this way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest I simply copy/pasted the test for roundtrip_explain
below -- I agree the pattern is quite nice
@@ -213,6 +213,16 @@ pub enum LogicalPlan { | |||
/// The output schema of the explain (2 columns of text) | |||
schema: DFSchemaRef, | |||
}, | |||
/// Runs the actual plan, and then prints the physical plan with | |||
/// with execution metrics. | |||
Analyze { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am about to ask when I saw the above code that has analyze as different function. I am worry about future inconsistency and headache of keeping them consistent, as well as redundant work when we change or improve something. I would prefer to keep them in the same LogicalPlan
@NGA-TRAN I also went back and forth on this point. The existing And thus even though there is definitely some redundancy, I eventually concluded that a new The physical plan ( |
I agree that a new variant makes sense here, for the reasons @alamb enumerated. Also, pretty awesome PR! 💯 |
3ff7ffd
to
87caa91
Compare
Which issue does this PR close?
Resolves #779
Rationale for this change
EXPLAIN PLAN
is great to understand what DataFusion plans to do, but it is hard today using the SQL or dataframe interface to understand in more depth what actually happened during execution.My real usecase is being able to see how many rows flowed through each operator as well as the "time spent" and "rows produced" by each operator, and this PR is a step in that direction.
What changes are included in this PR?
EXPLAIN ANALYZE
and EXPLAIN ANALYZE VERBOSE` (example below) sqlParquetStrream
intoRecordBatchReceiverStream
for reuseAre there any user-facing changes?
Yes,
EXPLAIN ANALYZE
now does something different thanEXPLAIN
Example of use
Run CLI
Example
EXPLAIN ANALYZE
outputExample
EXPLAIN ANALYZE VERBOSE
outputFuture work:
Note this PR is designed just to hook up / plumb the existing code and metrics we have into SQL (basically what got added in #662). I plan a sequence of follow on PRs to both improve the metrics infrastructure #679 and add/fix the metrics that are actually reported so they are consistent. The specific metrics that are displayed are verbose and somewhat ad hoc at the moment.