Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add "baseline" metrics to all built in operators #866

Closed
alamb opened this issue Aug 12, 2021 · 5 comments · Fixed by #1018
Closed

Add "baseline" metrics to all built in operators #866

alamb opened this issue Aug 12, 2021 · 5 comments · Fixed by #1018
Assignees
Labels
datafusion Changes in the datafusion crate enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Aug 12, 2021

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
I would like to be able get an overall understanding of where time is being spent during query execution via EXPLAIN ANALYZE (see #858) so that I know where to focuse additional performance optimization activities

Additionally, I would like to be able to graph a stacked flamechart such as the following see more details on https://github.com/influxdata/influxdb_iox/issues/2273) that shows when the different operators ran in relation to each other.

Screen Shot 2021-08-12 at 11 14 33 AM

Describe the solution you'd like
I would like to instrument all operators (impl ExecutionPlan) included in DataFusion so that they produce at least the following metrics:

  1. output_rows: total rows produced at the output of the operator
  2. cpu_nanos: the total time spent (not including any time spent in the input stream or waiting to be scheduled)
  3. start_time: the wall clock time at which execute was run
  4. stop_time: the wall clock time at which the last output record batch was produced

I plan to use the SQLMetric infrastructure for doing so, probably after #679

Describe alternatives you've considered
Open questions:

  1. Handling the output of different partitions (each operator can produce multiple output partitions / streams, and it is not yet clear to me if recording stats on a per partition level is important)
  2. How to handle operators that don't provide metrics such as potentially user defined ones (probably will fill in with their parents)

Additional context

Related work:

@alamb alamb added enhancement New feature or request datafusion Changes in the datafusion crate labels Aug 12, 2021
@NGA-TRAN
Copy link
Contributor

If possible, can we have number of output streams/partitions per operator and their corresponding output rows, too? I am not sure if they are captured in repartition or not. If IIRC, the repartitioning only happens with certain setups.

@alamb
Copy link
Contributor Author

alamb commented Aug 12, 2021

If possible, can we have number of output streams/partitions per operator and their corresponding output rows, too? I am not sure if they are captured in repartition or not. If IIRC, the repartitioning only happens with certain setups.

One thing perhaps we could do is to capture the statistics for each output partition and then add some way to aggregate them together. I think @andygrove suggested something like this on #679 (comment) though in the context of aggregating for distributed queries

@andygrove
Copy link
Member

This looks really cool. It would be nice to have the option to show the individual partitions stacked as smaller lines below each operator. It would make sense to collect stats per partition and then aggregate them for reporting.

@houqp
Copy link
Member

houqp commented Aug 13, 2021

How to handle operators that don't provide metrics such as potentially user defined ones (probably will fill in with their parents)

I think there is also the case where user defined operators are the parents and we need to generate metrics recursively for their children too for end to end tracing.

The list of metrics look like stuff that can be automatically generated across all operators if we can force all operator implementations to follow a certain convention. For example, instead of calling execute method directly, have all operator call a wrapper execute method that handles the builtin metrics collection. That way, it just becomes something all operators will get out of the box in a consistent way.

@alamb
Copy link
Contributor Author

alamb commented Aug 13, 2021

instead of calling execute method directly, have all operator call a wrapper execute method that handles the builtin metrics collection. That way, it just becomes something all operators will out of the box in a consistent way.

Thanks @houqp -- using a wrapper is a good (great!) idea -- I will think about how this might look like and give it a shot, likely after #679

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datafusion Changes in the datafusion crate enhancement New feature or request
Projects
None yet
4 participants