Tracking pipeline and task execution time as well as controller time #164

tanner-bruce · 2018-10-17T19:01:00Z

Expected Behavior

As a developer of build-pipeline, being able to see an entire trace of someones pipeline execution as well as the execution context through the pipeline controller would help diagnose performance issues and give visibility into what is going on

It should be possible to query for metrics around:

How long Tasks of a particular type are taking
How long Steps of a particular Task are taking
Number of TaskRuns and PipelineRuns being created (tied to the corresponding Task/Pipeline)

The mechanism that we use to collect metrics should not hold state globally, i.e. it should be possible to configure the metrics collecting mechanism in one section of the code without affecting other sections (the one currently used in knative/pkg is global, e.g. see this code).

Actual Behavior

Some of this data is available via startedAt and finishedAt in the status fields, however it is not queryable, and we would like more data points.

Additional Info

Having this information would be extremely valuable as it would quickly highlight points of interest in the pipeline controller and give visibility into the entire system.

The text was updated successfully, but these errors were encountered:

tanner-bruce · 2018-10-18T05:07:13Z

After speaking in slack with @aaron-prindle we briefly touched on the possibility of building this in via the same process that will be used for overriding the entrypoint for gathering the build logs.

bobcatfish · 2018-10-18T20:45:53Z

@tanner-bruce can you give some examples of what kind of info you'd want to see exactly, and when you'd use it?

bobcatfish · 2018-10-18T20:47:42Z

I'm guessing this is something more like we want to be monitoring these metrics as the person administering the deployment of the Pipeline CRD, is that right?

tanner-bruce · 2018-10-19T03:00:50Z

The main one for me, as a user, is tracking build times over days and months, as well as various test times. It would also be kind of slick if during the TaskRuns themselves we could be passed in the trace context and add spans to it from there.

As a cluster operator, it could also be useful to be able to see the timings at a global level, to help determine something like if we need to add another node to the pool.

At the same time, having the controller emit metrics could also be useful for things like checking the number of jobs running for something like seeing trends of how often people/teams are running builds. Depending on how the Pipeline CRD goes in the future, if say Pipeline Runs were queued at some point, being able to check and alert on the depth of the queue would be very useful.

Part of this request also stems from some frustrations with Concourse where there isn't much visibility into the system itself, however with the Pipeline CRD utilizing Kubernetes for much of the heavy lifting I think the utility of this isn't quite as great, but still useful in my opinion.

Having the startedAt and finishedAt is useful, but being able to visualize them and query on them would be much more useful from an operators standpoint.

bobcatfish · 2018-10-22T15:55:12Z

@tanner-bruce makes sense, thanks for the detailed explanation! I added some requirements to the description, feel free to change these and/or add to them if that's not quite right.

bobcatfish · 2018-10-31T21:48:00Z

Just a note that when we get here, it would be great if we could switch to a metrics collector that isn't global - e.g. see the problem this causes in #211

ghost · 2019-09-19T17:24:54Z

We had our first meeting regarding observability, specifically metrics, today and work is now underway. There are a couple of other issues that overlap in theme with this one. I am linking them together here for us to review later and figure out which to keep and which to close.

Related issues:
#164
#540
#855

Metrics Design Doc

Notes from the initial metrics meeting

Often, as a developer or administartor(ops) I want some insights about pipeline behavior in terms time taken to execute pipleinerun/taskrun, its success or failure ratio, pod latencies etc. At present tekton pipelines has very limted ways to surface such information or its hard to get those details looking at resources yamls. This patch exposes above mentioned pipelines metrics on '/metrics' endpoint using knative `pkg/metrics` package. User can collect such metrics using prometheus, stackdriver or other supported metrics system. To some extent its solves - tektoncd#540 - tektoncd#164

Often, as a developer or administrator(ops) I want some insights about pipeline behavior in terms of time taken to execute pipleinerun/taskrun, its success or failure ratio, pod latencies etc. At present tekton pipelines has very limited ways to surface such information or it's hard to get those details looking at resources yamls. This patch exposes above mentioned pipelines metrics on '/metrics' endpoint using knative `pkg/metrics` package. User can collect such metrics using prometheus, stackdriver or other supported metrics system. To some extent its solves - tektoncd#540 - tektoncd#164

Often, as a developer or administrator(ops) I want some insights about pipeline behavior in terms of time taken to execute pipleinerun/taskrun, its success or failure ratio, pod latencies etc. At present tekton pipelines has very limited ways to surface such information or it's hard to get those details looking at resources yamls. This patch exposes above mentioned pipelines metrics on '/metrics' endpoint using knative `pkg/metrics` package. User can collect such metrics using prometheus, stackdriver or other supported metrics system. To some extent its solves - #540 - #164

afrittoli · 2020-05-29T13:25:34Z

There is functionality already in knative/pkg that would help to track reconciler stats

tekton-robot · 2020-08-14T14:43:18Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.

/lifecycle rotten

Send feedback to tektoncd/plumbing.

tekton-robot · 2020-08-14T14:43:19Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

/close

Send feedback to tektoncd/plumbing.

tekton-robot · 2020-08-14T14:43:21Z

@tekton-robot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

/close

Send feedback to tektoncd/plumbing.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

The EventListener was referring to the Binding via `name` instead of `ref`. Also, run the getting-started examples as part of the e2e YAML tests. While this won't catch all issues with the examples, it should catch obvious syntax issues like this one. Fixes tektoncd#639 Fixes tektoncd#164 Signed-off-by: Dibyo Mukherjee <[email protected]>

tejal29 added the design This task is about creating and discussing a design label Oct 17, 2018

bobcatfish added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Oct 18, 2018

tanner-bruce mentioned this issue Oct 19, 2018

Tracking pipeline and task execution time #163

Closed

ghost mentioned this issue Sep 19, 2019

Start measuring Tekton Pipelines performance #540

Open

ghost mentioned this issue Sep 19, 2019

Observability aspects of CD pipelines #855

Closed

hrishin mentioned this issue Oct 7, 2019

Adds pipeline metrics 🔭 #1387

Merged

3 tasks

jlpettersson mentioned this issue Jun 15, 2020

Instrument Tekton resources for tracing #2814

Closed

tekton-robot added the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Aug 14, 2020

tekton-robot closed this as completed Aug 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tracking pipeline and task execution time as well as controller time #164

Tracking pipeline and task execution time as well as controller time #164

tanner-bruce commented Oct 17, 2018 •

edited by bobcatfish

Loading

tanner-bruce commented Oct 18, 2018

bobcatfish commented Oct 18, 2018

bobcatfish commented Oct 18, 2018

tanner-bruce commented Oct 19, 2018 •

edited

Loading

bobcatfish commented Oct 22, 2018

bobcatfish commented Oct 31, 2018

ghost commented Sep 19, 2019

afrittoli commented May 29, 2020 •

edited

Loading

tekton-robot commented Aug 14, 2020

tekton-robot commented Aug 14, 2020

tekton-robot commented Aug 14, 2020

Tracking pipeline and task execution time as well as controller time #164

Tracking pipeline and task execution time as well as controller time #164

Comments

tanner-bruce commented Oct 17, 2018 • edited by bobcatfish Loading

Expected Behavior

Actual Behavior

Additional Info

tanner-bruce commented Oct 18, 2018

bobcatfish commented Oct 18, 2018

bobcatfish commented Oct 18, 2018

tanner-bruce commented Oct 19, 2018 • edited Loading

bobcatfish commented Oct 22, 2018

bobcatfish commented Oct 31, 2018

ghost commented Sep 19, 2019

afrittoli commented May 29, 2020 • edited Loading

tekton-robot commented Aug 14, 2020

tekton-robot commented Aug 14, 2020

tekton-robot commented Aug 14, 2020

tanner-bruce commented Oct 17, 2018 •

edited by bobcatfish

Loading

tanner-bruce commented Oct 19, 2018 •

edited

Loading

afrittoli commented May 29, 2020 •

edited

Loading