Add support for EXPLAIN ANALYZE #858

alamb · 2021-08-11T18:20:02Z

Which issue does this PR close?

Resolves #779

Rationale for this change

EXPLAIN PLAN is great to understand what DataFusion plans to do, but it is hard today using the SQL or dataframe interface to understand in more depth what actually happened during execution.

My real usecase is being able to see how many rows flowed through each operator as well as the "time spent" and "rows produced" by each operator, and this PR is a step in that direction.

What changes are included in this PR?

Add basic plan nodes for EXPLAIN ANALYZE and EXPLAIN ANALYZE VERBOSE` (example below) sql
Refactor special case ParquetStrream into RecordBatchReceiverStream for reuse

Are there any user-facing changes?

Yes, EXPLAIN ANALYZE now does something different than EXPLAIN

Example of use

echo "1,A" > /tmp/foo.csv
echo "1,B" >> /tmp/foo.csv
echo "2,A" >> /tmp/foo.csv

Run CLI

cargo run --bin datafusion-cli

CREATE EXTERNAL TABLE foo(x INT, b VARCHAR) STORED AS CSV LOCATION '/tmp/foo.csv';

Example `EXPLAIN ANALYZE` output

> EXPLAIN ANALYZE SELECT SUM(x) FROM foo GROUP BY b;
+-------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type         | plan                                                                                                                                                      |
+-------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+
| Plan with Metrics | CoalescePartitionsExec, metrics=[]                                                                                                                        |
|                   |   ProjectionExec: expr=[SUM(foo.x)@1 as SUM(x)], metrics=[]                                                                                               |
|                   |     HashAggregateExec: mode=FinalPartitioned, gby=[b@0 as b], aggr=[SUM(x)], metrics=[outputRows=2]                                                       |
|                   |       CoalesceBatchesExec: target_batch_size=4096, metrics=[]                                                                                             |
|                   |         RepartitionExec: partitioning=Hash([Column { name: "b", index: 0 }], 16), metrics=[sendTime=839560, fetchTime=122528525, repartitionTime=5327877] |
|                   |           HashAggregateExec: mode=Partial, gby=[b@1 as b], aggr=[SUM(x)], metrics=[outputRows=2]                                                          |
|                   |             RepartitionExec: partitioning=RoundRobinBatch(16), metrics=[fetchTime=5660489, repartitionTime=0, sendTime=8012]                              |
|                   |               CsvExec: source=Path(/tmp/foo.csv: [/tmp/foo.csv]), has_header=false, metrics=[]                                                            |
+-------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set. Query took 0.012 seconds.

Example `EXPLAIN ANALYZE VERBOSE` output

> EXPLAIN ANALYZE VERBOSE SELECT SUM(x) FROM foo GROUP BY b;
+-------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type         | plan                                                                                                                                                      |
+-------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+
| Plan with Metrics | CoalescePartitionsExec, metrics=[]                                                                                                                        |
|                   |   ProjectionExec: expr=[SUM(foo.x)@1 as SUM(x)], metrics=[]                                                                                               |
|                   |     HashAggregateExec: mode=FinalPartitioned, gby=[b@0 as b], aggr=[SUM(x)], metrics=[outputRows=2]                                                       |
|                   |       CoalesceBatchesExec: target_batch_size=4096, metrics=[]                                                                                             |
|                   |         RepartitionExec: partitioning=Hash([Column { name: "b", index: 0 }], 16), metrics=[repartitionTime=6584110, fetchTime=132927514, sendTime=904001] |
|                   |           HashAggregateExec: mode=Partial, gby=[b@1 as b], aggr=[SUM(x)], metrics=[outputRows=2]                                                          |
|                   |             RepartitionExec: partitioning=RoundRobinBatch(16), metrics=[repartitionTime=0, sendTime=8246, fetchTime=6239096]                              |
|                   |               CsvExec: source=Path(/tmp/foo.csv: [/tmp/foo.csv]), has_header=false, metrics=[]                                                            |
| Output Rows       | 2                                                                                                                                                         |
| Duration          | 10.283764ms                                                                                                                                               |
+-------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+
3 rows in set. Query took 0.014 seconds.

Future work:

Note this PR is designed just to hook up / plumb the existing code and metrics we have into SQL (basically what got added in #662). I plan a sequence of follow on PRs to both improve the metrics infrastructure #679 and add/fix the metrics that are actually reported so they are consistent. The specific metrics that are displayed are verbose and somewhat ad hoc at the moment.

alamb · 2021-08-11T18:20:44Z

datafusion/src/logical_plan/plan.rs

@@ -213,6 +213,16 @@ pub enum LogicalPlan {
        /// The output schema of the explain (2 columns of text)
        schema: DFSchemaRef,
    },
+    /// Runs the actual plan, and then prints the physical plan with
+    /// with execution metrics.
+    Analyze {


NOTE: I chose a new LogicalPlan node because the implementation for ANALYZE is so different than EXPLAIN. However, it would be possible to re-use the same LogicalPlan node if people prefer

I am about to ask when I saw the above code that has analyze as different function. I am worry about future inconsistency and headache of keeping them consistent, as well as redundant work when we change or improve something. I would prefer to keep them in the same LogicalPlan

NGA-TRAN

I like the fact you built this infrastructure before adding new coulters. It is easy to understand and review. My only concern is the analyze is implemented in different path of the explain which seems also in different path of the actual plan. I am not sure how tricky to keep them in the same path but if we can, it will help us keep the results consistent, avoid redundancy and headache for future work.

NGA-TRAN · 2021-08-12T15:02:54Z

ballista/rust/core/src/serde/logical_plan/mod.rs

+        .and_then(|plan| plan.build())
+        .map_err(BallistaError::DataFusionError)?;
+
+        roundtrip_test!(plan);


When we talk I need to learn the the effectiveness of this round trip test thats convert a logical/physical plan into photo and back. The tests look simple and easy to understand this way.

To be honest I simply copy/pasted the test for roundtrip_explain below -- I agree the pattern is quite nice

NGA-TRAN · 2021-08-12T15:09:32Z

datafusion/src/logical_plan/plan.rs

@@ -213,6 +213,16 @@ pub enum LogicalPlan {
        /// The output schema of the explain (2 columns of text)
        schema: DFSchemaRef,
    },
+    /// Runs the actual plan, and then prints the physical plan with
+    /// with execution metrics.
+    Analyze {


I am about to ask when I saw the above code that has analyze as different function. I am worry about future inconsistency and headache of keeping them consistent, as well as redundant work when we change or improve something. I would prefer to keep them in the same LogicalPlan

alamb · 2021-08-12T15:56:08Z

I am about to ask when I saw the above code that has analyze as different function. I am worry about future inconsistency and headache of keeping them consistent, as well as redundant work when we change or improve something. I would prefer to keep them in the same LogicalPlan

@NGA-TRAN I also went back and forth on this point. The existing LogicalPlan::Explain is special cased several times during planning (so it can capture the results of intermediate passes as strings), and since those intermediate strings aren't used by Analyze we would then have to do an extra check in each special case

And thus even though there is definitely some redundancy, I eventually concluded that a new LogicalPlan type made things most clear.

The physical plan (ExecutionPlan) for Analyze is also very different but it would be feasible to use the different physical plans for the same logical plan.

jorgecarleitao · 2021-08-12T15:58:22Z

I agree that a new variant makes sense here, for the reasons @alamb enumerated.

Also, pretty awesome PR! 💯

alamb added the api change Changes the API exposed to users of the crate label Aug 11, 2021

github-actions bot added datafusion Changes in the datafusion crate sql SQL Planner labels Aug 11, 2021

alamb commented Aug 11, 2021

View reviewed changes

alamb force-pushed the alamn/analyze_explain branch from 97b5928 to c7d1d4a Compare August 11, 2021 19:28

github-actions bot added the ballista label Aug 11, 2021

alamb force-pushed the alamn/analyze_explain branch from c7d1d4a to 3ff7ffd Compare August 11, 2021 19:30

alamb marked this pull request as ready for review August 11, 2021 19:30

Dandandan approved these changes Aug 12, 2021

View reviewed changes

NGA-TRAN approved these changes Aug 12, 2021

View reviewed changes

alamb added 2 commits August 12, 2021 10:56

Add support for EXPLAIN ANALYZE

94453c4

fix fmt

87caa91

alamb force-pushed the alamn/analyze_explain branch from 3ff7ffd to 87caa91 Compare August 12, 2021 15:58

alamb merged commit bd3666b into apache:master Aug 12, 2021

alamb deleted the alamn/analyze_explain branch August 12, 2021 16:43

alamb mentioned this pull request Aug 12, 2021

Add "baseline" metrics to all built in operators #866

Closed

alamb mentioned this pull request Aug 20, 2021

Add BaselineMetrics, Timestamp metrics, add for CoalescePartitionsExec, rename output_time -> elapsed_compute #909

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for EXPLAIN ANALYZE #858

Add support for EXPLAIN ANALYZE #858

alamb commented Aug 11, 2021 •

edited

Loading

alamb Aug 11, 2021

NGA-TRAN Aug 12, 2021

NGA-TRAN left a comment

NGA-TRAN Aug 12, 2021

alamb Aug 12, 2021

NGA-TRAN Aug 12, 2021

alamb commented Aug 12, 2021

jorgecarleitao commented Aug 12, 2021

Add support for EXPLAIN ANALYZE #858

Add support for EXPLAIN ANALYZE #858

Conversation

alamb commented Aug 11, 2021 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Example of use

Example EXPLAIN ANALYZE output

Example EXPLAIN ANALYZE VERBOSE output

Future work:

alamb Aug 11, 2021

Choose a reason for hiding this comment

NGA-TRAN Aug 12, 2021

Choose a reason for hiding this comment

NGA-TRAN left a comment

Choose a reason for hiding this comment

NGA-TRAN Aug 12, 2021

Choose a reason for hiding this comment

alamb Aug 12, 2021

Choose a reason for hiding this comment

NGA-TRAN Aug 12, 2021

Choose a reason for hiding this comment

alamb commented Aug 12, 2021

jorgecarleitao commented Aug 12, 2021

alamb commented Aug 11, 2021 •

edited

Loading

Example `EXPLAIN ANALYZE` output

Example `EXPLAIN ANALYZE VERBOSE` output