Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking pipeline and task execution time as well as controller time #164

Closed
tanner-bruce opened this issue Oct 17, 2018 · 11 comments
Closed
Labels
design This task is about creating and discussing a design help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@tanner-bruce
Copy link

tanner-bruce commented Oct 17, 2018

Expected Behavior

As a developer of build-pipeline, being able to see an entire trace of someones pipeline execution as well as the execution context through the pipeline controller would help diagnose performance issues and give visibility into what is going on

It should be possible to query for metrics around:

  • How long Tasks of a particular type are taking
  • How long Steps of a particular Task are taking
  • Number of TaskRuns and PipelineRuns being created (tied to the corresponding Task/Pipeline)

The mechanism that we use to collect metrics should not hold state globally, i.e. it should be possible to configure the metrics collecting mechanism in one section of the code without affecting other sections (the one currently used in knative/pkg is global, e.g. see this code).

Actual Behavior

Some of this data is available via startedAt and finishedAt in the status fields, however it is not queryable, and we would like more data points.

Additional Info

Having this information would be extremely valuable as it would quickly highlight points of interest in the pipeline controller and give visibility into the entire system.

@tejal29 tejal29 added the design This task is about creating and discussing a design label Oct 17, 2018
@tanner-bruce
Copy link
Author

After speaking in slack with @aaron-prindle we briefly touched on the possibility of building this in via the same process that will be used for overriding the entrypoint for gathering the build logs.

@bobcatfish bobcatfish added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Oct 18, 2018
@bobcatfish
Copy link
Collaborator

@tanner-bruce can you give some examples of what kind of info you'd want to see exactly, and when you'd use it?

@bobcatfish
Copy link
Collaborator

I'm guessing this is something more like we want to be monitoring these metrics as the person administering the deployment of the Pipeline CRD, is that right?

@tanner-bruce
Copy link
Author

tanner-bruce commented Oct 19, 2018

The main one for me, as a user, is tracking build times over days and months, as well as various test times. It would also be kind of slick if during the TaskRuns themselves we could be passed in the trace context and add spans to it from there.

As a cluster operator, it could also be useful to be able to see the timings at a global level, to help determine something like if we need to add another node to the pool.

At the same time, having the controller emit metrics could also be useful for things like checking the number of jobs running for something like seeing trends of how often people/teams are running builds. Depending on how the Pipeline CRD goes in the future, if say Pipeline Runs were queued at some point, being able to check and alert on the depth of the queue would be very useful.

Part of this request also stems from some frustrations with Concourse where there isn't much visibility into the system itself, however with the Pipeline CRD utilizing Kubernetes for much of the heavy lifting I think the utility of this isn't quite as great, but still useful in my opinion.

Having the startedAt and finishedAt is useful, but being able to visualize them and query on them would be much more useful from an operators standpoint.

@bobcatfish
Copy link
Collaborator

@tanner-bruce makes sense, thanks for the detailed explanation! I added some requirements to the description, feel free to change these and/or add to them if that's not quite right.

@bobcatfish
Copy link
Collaborator

Just a note that when we get here, it would be great if we could switch to a metrics collector that isn't global - e.g. see the problem this causes in #211

@ghost
Copy link

ghost commented Sep 19, 2019

We had our first meeting regarding observability, specifically metrics, today and work is now underway. There are a couple of other issues that overlap in theme with this one. I am linking them together here for us to review later and figure out which to keep and which to close.

Related issues:
#164
#540
#855

Metrics Design Doc

Notes from the initial metrics meeting

hrishin added a commit to hrishin/tekton-pipeline that referenced this issue Oct 7, 2019
Often, as a developer or administartor(ops) I want some insights
about pipeline behavior in terms time taken to execute pipleinerun/taskrun,
its success or failure ratio, pod latencies etc.
At present tekton pipelines has very limted ways to surface such information
or its hard to get those details looking at resources yamls.

This patch exposes above mentioned pipelines metrics on '/metrics'
endpoint using knative `pkg/metrics` package. User can collect such
metrics using prometheus, stackdriver or other supported metrics system.

To some extent its solves
 - tektoncd#540
 - tektoncd#164
hrishin added a commit to hrishin/tekton-pipeline that referenced this issue Oct 7, 2019
Often, as a developer or administartor(ops) I want some insights
about pipeline behavior in terms time taken to execute pipleinerun/taskrun,
its success or failure ratio, pod latencies etc.
At present tekton pipelines has very limted ways to surface such information
or its hard to get those details looking at resources yamls.

This patch exposes above mentioned pipelines metrics on '/metrics'
endpoint using knative `pkg/metrics` package. User can collect such
metrics using prometheus, stackdriver or other supported metrics system.

To some extent its solves
 - tektoncd#540
 - tektoncd#164
hrishin added a commit to hrishin/tekton-pipeline that referenced this issue Oct 7, 2019
Often, as a developer or administrator(ops) I want some insights
about pipeline behavior in terms of time taken to execute pipleinerun/taskrun,
its success or failure ratio, pod latencies etc.
At present tekton pipelines has very limited ways to surface such information
or it's hard to get those details looking at resources yamls.

This patch exposes above mentioned pipelines metrics on '/metrics'
endpoint using knative `pkg/metrics` package. User can collect such
metrics using prometheus, stackdriver or other supported metrics system.

To some extent its solves
 - tektoncd#540
 - tektoncd#164
@hrishin hrishin mentioned this issue Oct 7, 2019
3 tasks
hrishin added a commit to hrishin/tekton-pipeline that referenced this issue Oct 12, 2019
Often, as a developer or administrator(ops) I want some insights
about pipeline behavior in terms of time taken to execute pipleinerun/taskrun,
its success or failure ratio, pod latencies etc.
At present tekton pipelines has very limited ways to surface such information
or it's hard to get those details looking at resources yamls.

This patch exposes above mentioned pipelines metrics on '/metrics'
endpoint using knative `pkg/metrics` package. User can collect such
metrics using prometheus, stackdriver or other supported metrics system.

To some extent its solves
 - tektoncd#540
 - tektoncd#164
hrishin added a commit to hrishin/tekton-pipeline that referenced this issue Oct 16, 2019
Often, as a developer or administrator(ops) I want some insights
about pipeline behavior in terms of time taken to execute pipleinerun/taskrun,
its success or failure ratio, pod latencies etc.
At present tekton pipelines has very limited ways to surface such information
or it's hard to get those details looking at resources yamls.

This patch exposes above mentioned pipelines metrics on '/metrics'
endpoint using knative `pkg/metrics` package. User can collect such
metrics using prometheus, stackdriver or other supported metrics system.

To some extent its solves
 - tektoncd#540
 - tektoncd#164
hrishin added a commit to hrishin/tekton-pipeline that referenced this issue Oct 17, 2019
Often, as a developer or administrator(ops) I want some insights
about pipeline behavior in terms of time taken to execute pipleinerun/taskrun,
its success or failure ratio, pod latencies etc.
At present tekton pipelines has very limited ways to surface such information
or it's hard to get those details looking at resources yamls.

This patch exposes above mentioned pipelines metrics on '/metrics'
endpoint using knative `pkg/metrics` package. User can collect such
metrics using prometheus, stackdriver or other supported metrics system.

To some extent its solves
 - tektoncd#540
 - tektoncd#164
hrishin added a commit to hrishin/tekton-pipeline that referenced this issue Oct 17, 2019
Often, as a developer or administrator(ops) I want some insights
about pipeline behavior in terms of time taken to execute pipleinerun/taskrun,
its success or failure ratio, pod latencies etc.
At present tekton pipelines has very limited ways to surface such information
or it's hard to get those details looking at resources yamls.

This patch exposes above mentioned pipelines metrics on '/metrics'
endpoint using knative `pkg/metrics` package. User can collect such
metrics using prometheus, stackdriver or other supported metrics system.

To some extent its solves
 - tektoncd#540
 - tektoncd#164
hrishin added a commit to hrishin/tekton-pipeline that referenced this issue Oct 17, 2019
Often, as a developer or administrator(ops) I want some insights
about pipeline behavior in terms of time taken to execute pipleinerun/taskrun,
its success or failure ratio, pod latencies etc.
At present tekton pipelines has very limited ways to surface such information
or it's hard to get those details looking at resources yamls.

This patch exposes above mentioned pipelines metrics on '/metrics'
endpoint using knative `pkg/metrics` package. User can collect such
metrics using prometheus, stackdriver or other supported metrics system.

To some extent its solves
 - tektoncd#540
 - tektoncd#164
tekton-robot pushed a commit that referenced this issue Oct 17, 2019
Often, as a developer or administrator(ops) I want some insights
about pipeline behavior in terms of time taken to execute pipleinerun/taskrun,
its success or failure ratio, pod latencies etc.
At present tekton pipelines has very limited ways to surface such information
or it's hard to get those details looking at resources yamls.

This patch exposes above mentioned pipelines metrics on '/metrics'
endpoint using knative `pkg/metrics` package. User can collect such
metrics using prometheus, stackdriver or other supported metrics system.

To some extent its solves
 - #540
 - #164
@afrittoli
Copy link
Member

afrittoli commented May 29, 2020

There is functionality already in knative/pkg that would help to track reconciler stats

@tekton-robot
Copy link
Collaborator

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.

/lifecycle rotten

Send feedback to tektoncd/plumbing.

@tekton-robot
Copy link
Collaborator

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

/close

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Aug 14, 2020
@tekton-robot
Copy link
Collaborator

@tekton-robot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

/close

Send feedback to tektoncd/plumbing.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

pradeepitm12 pushed a commit to pradeepitm12/pipeline that referenced this issue Jan 28, 2021
The EventListener was referring to the Binding via `name` instead of `ref`.
Also, run the getting-started examples as part of the e2e YAML tests. While
this won't catch all issues with the examples, it should catch obvious syntax
issues like this one.

Fixes tektoncd#639
Fixes tektoncd#164

Signed-off-by: Dibyo Mukherjee <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design This task is about creating and discussing a design help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

5 participants