Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add Statistics in the display of ExecutionPlan / physical_plan format #7254

Closed
liukun4515 opened this issue Aug 10, 2023 · 2 comments · Fixed by #7459
Closed

add Statistics in the display of ExecutionPlan / physical_plan format #7254

liukun4515 opened this issue Aug 10, 2023 · 2 comments · Fixed by #7459
Labels
enhancement New feature or request

Comments

@liukun4515
Copy link
Contributor

liukun4515 commented Aug 10, 2023

Is your feature request related to a problem or challenge?

Now when i get a physical plan like that:

❯ explain select test1.id,test2.int_col from test1 join test2 on test1.id = test2.bigint_col;
+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                                                                                            |
+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan  | Projection: test1.id, test2.int_col                                                                                                                             |
|               |   Inner Join: CAST(test1.id AS Int64) = test2.bigint_col                                                                                                        |
|               |     TableScan: test1 projection=[id]                                                                                                                            |
|               |     TableScan: test2 projection=[int_col, bigint_col]                                                                                                           |
| physical_plan | ProjectionExec: expr=[id@0 as id, int_col@1 as int_col]                                                                                                         |
|               |   ProjectionExec: expr=[id@0 as id, int_col@2 as int_col, bigint_col@3 as bigint_col]                                                                           |
|               |     CoalesceBatchesExec: target_batch_size=8192                                                                                                                 |
|               |       HashJoinExec: mode=CollectLeft, join_type=Inner, on=[(CAST(test1.id AS Int64)@1, bigint_col@1)]                                                           |
|               |         ProjectionExec: expr=[id@0 as id, CAST(id@0 AS Int64) as CAST(test1.id AS Int64)]                                                                       |
|               |           ParquetExec: file_groups={1 group: [[Users/kliu3/Documents/arrow-ballista/target/debug/alltypes_plain.parquet]]}, projection=[id]                |
|               |         ParquetExec: file_groups={1 group: [[Users/kliu3/Documents/arrow-ballista/target/debug/alltypes_plain.parquet]]}, projection=[int_col, bigint_col] |
|               |                                                                                                                                                                 |
+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+
2 rows in set. Query took 0.136 seconds.

But the physical plan missing the data of Statistics.

Can we add the Statistics in the physical plan format?

The struct of statistics is

pub struct Statistics {
    /// The number of table rows
    pub num_rows: Option<usize>,
    /// total bytes of the table rows
    pub total_byte_size: Option<usize>,
    /// Statistics on a column level
    pub column_statistics: Option<Vec<ColumnStatistics>>,
    /// If true, any field that is `Some(..)` is the actual value in the data provided by the operator (it is not
    /// an estimate). Any or all other fields might still be None, in which case no information is known.
    /// if false, any field that is `Some(..)` may contain an inexact estimate and may not be the actual value.
    pub is_exact: bool,
}

we can just log the num_rows, total_byte_size, is_exact and ignore the column_statistics

Describe the solution you'd like

append the Statistics in the DisplayAs trait

cc @alamb @jackwener

Describe alternatives you've considered

No response

Additional context

this will change many test cases for the plan

@liukun4515 liukun4515 added the enhancement New feature or request label Aug 10, 2023
@liukun4515
Copy link
Contributor Author

cc @mingmwang @Ted-Jiang

@alamb
Copy link
Contributor

alamb commented Aug 10, 2023

we can just log the num_rows, total_byte_size, is_exact and ignore the column_statistics

I think showing the physical statistics to the explain output sounds like a good idea to me

Perhaps we could only show the statistics in EXPLAIN VERBOSE mode (so when DisplayFormatType::Verbose is passed to https://docs.rs/datafusion/latest/datafusion/physical_plan/display/trait.DisplayAs.html) ?

@alamb alamb changed the title add Statistics in the physical format add Statistics in the display of ExecutionPlan / physical_plan format Aug 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants