-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tracking pipeline and task execution time as well as controller time #164
Comments
After speaking in slack with @aaron-prindle we briefly touched on the possibility of building this in via the same process that will be used for overriding the entrypoint for gathering the build logs. |
@tanner-bruce can you give some examples of what kind of info you'd want to see exactly, and when you'd use it? |
I'm guessing this is something more like we want to be monitoring these metrics as the person administering the deployment of the Pipeline CRD, is that right? |
The main one for me, as a user, is tracking build times over days and months, as well as various test times. It would also be kind of slick if during the TaskRuns themselves we could be passed in the trace context and add spans to it from there. As a cluster operator, it could also be useful to be able to see the timings at a global level, to help determine something like if we need to add another node to the pool. At the same time, having the controller emit metrics could also be useful for things like checking the number of jobs running for something like seeing trends of how often people/teams are running builds. Depending on how the Pipeline CRD goes in the future, if say Pipeline Runs were queued at some point, being able to check and alert on the depth of the queue would be very useful. Part of this request also stems from some frustrations with Concourse where there isn't much visibility into the system itself, however with the Pipeline CRD utilizing Kubernetes for much of the heavy lifting I think the utility of this isn't quite as great, but still useful in my opinion. Having the startedAt and finishedAt is useful, but being able to visualize them and query on them would be much more useful from an operators standpoint. |
@tanner-bruce makes sense, thanks for the detailed explanation! I added some requirements to the description, feel free to change these and/or add to them if that's not quite right. |
Just a note that when we get here, it would be great if we could switch to a metrics collector that isn't global - e.g. see the problem this causes in #211 |
We had our first meeting regarding observability, specifically metrics, today and work is now underway. There are a couple of other issues that overlap in theme with this one. I am linking them together here for us to review later and figure out which to keep and which to close. |
Often, as a developer or administartor(ops) I want some insights about pipeline behavior in terms time taken to execute pipleinerun/taskrun, its success or failure ratio, pod latencies etc. At present tekton pipelines has very limted ways to surface such information or its hard to get those details looking at resources yamls. This patch exposes above mentioned pipelines metrics on '/metrics' endpoint using knative `pkg/metrics` package. User can collect such metrics using prometheus, stackdriver or other supported metrics system. To some extent its solves - tektoncd#540 - tektoncd#164
Often, as a developer or administartor(ops) I want some insights about pipeline behavior in terms time taken to execute pipleinerun/taskrun, its success or failure ratio, pod latencies etc. At present tekton pipelines has very limted ways to surface such information or its hard to get those details looking at resources yamls. This patch exposes above mentioned pipelines metrics on '/metrics' endpoint using knative `pkg/metrics` package. User can collect such metrics using prometheus, stackdriver or other supported metrics system. To some extent its solves - tektoncd#540 - tektoncd#164
Often, as a developer or administrator(ops) I want some insights about pipeline behavior in terms of time taken to execute pipleinerun/taskrun, its success or failure ratio, pod latencies etc. At present tekton pipelines has very limited ways to surface such information or it's hard to get those details looking at resources yamls. This patch exposes above mentioned pipelines metrics on '/metrics' endpoint using knative `pkg/metrics` package. User can collect such metrics using prometheus, stackdriver or other supported metrics system. To some extent its solves - tektoncd#540 - tektoncd#164
Often, as a developer or administrator(ops) I want some insights about pipeline behavior in terms of time taken to execute pipleinerun/taskrun, its success or failure ratio, pod latencies etc. At present tekton pipelines has very limited ways to surface such information or it's hard to get those details looking at resources yamls. This patch exposes above mentioned pipelines metrics on '/metrics' endpoint using knative `pkg/metrics` package. User can collect such metrics using prometheus, stackdriver or other supported metrics system. To some extent its solves - tektoncd#540 - tektoncd#164
Often, as a developer or administrator(ops) I want some insights about pipeline behavior in terms of time taken to execute pipleinerun/taskrun, its success or failure ratio, pod latencies etc. At present tekton pipelines has very limited ways to surface such information or it's hard to get those details looking at resources yamls. This patch exposes above mentioned pipelines metrics on '/metrics' endpoint using knative `pkg/metrics` package. User can collect such metrics using prometheus, stackdriver or other supported metrics system. To some extent its solves - tektoncd#540 - tektoncd#164
Often, as a developer or administrator(ops) I want some insights about pipeline behavior in terms of time taken to execute pipleinerun/taskrun, its success or failure ratio, pod latencies etc. At present tekton pipelines has very limited ways to surface such information or it's hard to get those details looking at resources yamls. This patch exposes above mentioned pipelines metrics on '/metrics' endpoint using knative `pkg/metrics` package. User can collect such metrics using prometheus, stackdriver or other supported metrics system. To some extent its solves - tektoncd#540 - tektoncd#164
Often, as a developer or administrator(ops) I want some insights about pipeline behavior in terms of time taken to execute pipleinerun/taskrun, its success or failure ratio, pod latencies etc. At present tekton pipelines has very limited ways to surface such information or it's hard to get those details looking at resources yamls. This patch exposes above mentioned pipelines metrics on '/metrics' endpoint using knative `pkg/metrics` package. User can collect such metrics using prometheus, stackdriver or other supported metrics system. To some extent its solves - tektoncd#540 - tektoncd#164
Often, as a developer or administrator(ops) I want some insights about pipeline behavior in terms of time taken to execute pipleinerun/taskrun, its success or failure ratio, pod latencies etc. At present tekton pipelines has very limited ways to surface such information or it's hard to get those details looking at resources yamls. This patch exposes above mentioned pipelines metrics on '/metrics' endpoint using knative `pkg/metrics` package. User can collect such metrics using prometheus, stackdriver or other supported metrics system. To some extent its solves - tektoncd#540 - tektoncd#164
Often, as a developer or administrator(ops) I want some insights about pipeline behavior in terms of time taken to execute pipleinerun/taskrun, its success or failure ratio, pod latencies etc. At present tekton pipelines has very limited ways to surface such information or it's hard to get those details looking at resources yamls. This patch exposes above mentioned pipelines metrics on '/metrics' endpoint using knative `pkg/metrics` package. User can collect such metrics using prometheus, stackdriver or other supported metrics system. To some extent its solves - #540 - #164
There is functionality already in knative/pkg that would help to track reconciler stats |
Stale issues rot after 30d of inactivity. /lifecycle rotten Send feedback to tektoncd/plumbing. |
Rotten issues close after 30d of inactivity. /close Send feedback to tektoncd/plumbing. |
@tekton-robot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
The EventListener was referring to the Binding via `name` instead of `ref`. Also, run the getting-started examples as part of the e2e YAML tests. While this won't catch all issues with the examples, it should catch obvious syntax issues like this one. Fixes tektoncd#639 Fixes tektoncd#164 Signed-off-by: Dibyo Mukherjee <[email protected]>
Expected Behavior
As a developer of build-pipeline, being able to see an entire trace of someones pipeline execution as well as the execution context through the pipeline controller would help diagnose performance issues and give visibility into what is going on
It should be possible to query for metrics around:
The mechanism that we use to collect metrics should not hold state globally, i.e. it should be possible to configure the metrics collecting mechanism in one section of the code without affecting other sections (the one currently used in knative/pkg is global, e.g. see this code).
Actual Behavior
Some of this data is available via
startedAt
andfinishedAt
in the status fields, however it is not queryable, and we would like more data points.Additional Info
Having this information would be extremely valuable as it would quickly highlight points of interest in the pipeline controller and give visibility into the entire system.
The text was updated successfully, but these errors were encountered: