-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add taskrun gauge metrics for k8s throttling because of defined resource quotas or k8s node constraints #6744
add taskrun gauge metrics for k8s throttling because of defined resource quotas or k8s node constraints #6744
Conversation
The following is the coverage report on the affected files.
|
The following is the coverage report on the affected files.
|
pkg/pipelinerunmetrics/metrics.go
Outdated
case pod.ReasonExceededResourceQuota: | ||
prsThrottledByQuota++ | ||
// only count once per pipelinerun | ||
exitForLoop = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use break instead of exitForLoop
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My recollection was that break only exited one for loop, and not both
The intent was to avoid bumping the counter for a pipelinerun multiple times because multiple taskruns are throttled.
Hence I want to leave the for _, pr :=range prs
for loop as well.
Given that context, are you OK then with leaving this as is @khrm ? Or do you think I'm missing something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wait if what I 'm saying is the case, I'm not breaking out of the last loop
yeah let me revisit @khrm .... either I break out of the last loop as well, or I just break here.
I'll update after I have fully revisited - thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK @khrm - while I did not add a break exactly here as you mentioned, you flagging this made me realize what I had previously pushed was not quite what I intended, which is:
- if a pipelinerun is both throttled on resource quota and node constraints with one of its underlying taskruns, bump the count once for resource quota and node constraint
- if a pipelinerun has multiple taskruns throttled by either resource quota or node constraints, don't bump the counts for each taskruns, but just once for the pipelinerun
aside from the code changes (2 booleans vs. one essentially for tracking this), I augmented the unit tests to create more than one taskrun per pipelinerun
PTAL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the current logic looks good to me.
pkg/pipelinerunmetrics/metrics.go
Outdated
case pod.ReasonExceededNodeResources: | ||
prsThrottledByNode++ | ||
// only count once per pipelinerun | ||
exitForLoop = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here: Can we use break instead of exitForLoop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fbe798d
to
2a4b0f2
Compare
The following is the coverage report on the affected files.
|
The following is the coverage report on the affected files.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
/assign @vdemeester |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: khrm, vdemeester The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Need to rebase this with the latest? @gabemontero |
yeah I saw tide citing that earlier today @khrm but now it just says it needs the lgtm label |
docs/metrics.md
Outdated
| `tekton_pipelines_controller_taskrun_duration_seconds_[bucket, sum, count]` | Histogram/LastValue(Gauge) | `status`=<status> <br> `*task`=<task_name> <br> `*taskrun`=<taskrun_name><br> `namespace`=<pipelineruns-taskruns-namespace> | experimental | | ||
| `tekton_pipelines_controller_taskrun_count` | Counter | `status`=<status> | experimental | | ||
| `tekton_pipelines_controller_running_taskruns_count` | Gauge | | experimental | | ||
| `tekton_pipelines_controller_running_taskruns_throttled_by_node_count` | Gauge | | experimental | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be quota_count
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep will update in next push - thanks
pkg/pipelinerunmetrics/metrics.go
Outdated
if err != nil { | ||
return fmt.Errorf("failed to list pipelineruns while generating metrics : %w", err) | ||
} | ||
|
||
var runningPRs int | ||
var prsThrottledByQuota int | ||
var prsThrottledByNode int | ||
for _, pr := range prs { | ||
if !pr.IsDone() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this if statement be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if in conjunction I move the runningPRs++
to after my
if pr.IsDone() {
continue
}
block then yes. In hindsight, I was leaning toward preserving existing code path when possible.
I'll update in the nex push, thanls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assuming we keep the pipelinerun version of the metric re: our discussion starting at https://github.com/tektoncd/pipeline/pull/6744/files#r1213456583
pkg/taskrunmetrics/metrics_test.go
Outdated
ctx, _ := ttesting.SetupFakeContext(t) | ||
informer := faketaskruninformer.Get(ctx) | ||
// Add N randomly-named TaskRuns with differently-succeeded statuses. | ||
for _, tr := range []*v1beta1.TaskRun{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it be possible to update this test to use table-driven testing instead?
pkg/taskrunmetrics/metrics.go
Outdated
@@ -344,17 +369,33 @@ func (r *Recorder) RunningTaskRuns(ctx context.Context, lister listers.TaskRunLi | |||
} | |||
|
|||
var runningTrs int | |||
var trsThrottledByQuota int | |||
var trsThrottledByNode int | |||
for _, pr := range trs { | |||
if !pr.IsDone() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this if statement be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep similar to the other pipelinerun discussion thread
for _, tr := range trs { | ||
// NOTE: while GetIndexer().Add(obj) works well in these unit tests for List(), for | ||
// trLister.TaskRuns(trNamespace).Get(trName) to work correctly we need to use GetStore() instead. | ||
// see AddToInformer in test/controller.go for an analogous solution |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not use SeedTestData
or AddToInformer
from test/controller.go instead? (here and in the taskrun tests)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I was trying to stay as close as possible to the patterns previously established in this test, like what I saw in TestRecordRunningPipelineRunsCount
But sure, I can tinker with say using SeedTestData
in my new test here and see where that lands.
reasons := []string{"", pod.ReasonExceededResourceQuota, pod.ReasonExceededNodeResources} | ||
|
||
// Add N randomly-named PipelineRuns with differently-succeeded statuses. | ||
for _, tr := range []*v1beta1.PipelineRun{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could this use table-driven testing instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similar rationale to https://github.com/tektoncd/pipeline/pull/6744/files#r1214733673 but sure I'll see about factoring that in as well.
pkg/pipelinerunmetrics/metrics.go
Outdated
@@ -59,6 +61,16 @@ var ( | |||
"Number of pipelineruns executing currently", | |||
stats.UnitDimensionless) | |||
runningPRsCountView *view.View | |||
|
|||
runningPRsThrottledByQuotaCount = stats.Float64("running_pipelineruns_throttled_by_quota_count", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just curious, how useful is it to have separate pipelinerun metrics for this when they are derived from the taskruns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair question. Yeah, in theory, any of the existing total count/running count/duration metrics could be derived from their taskrun children as well, even if their current implementations do not need to venture into pr.Status.ChildrenReferences
.
Argument for: it is an aggregation convenience, especially with dynamically generated pipelinerun names
Argument against: perhaps you can sort out aggregation, regardless of dynamically generated pipelinerun names ?? ... not sure I am 100% confident of the answer there in the context of how knative exports metrics to things like prometheus
If you feel strongly that it is redundant @lbernick I'm fine with pruning this PR to just the taskrun based metric, at least for the initial drop, and then we can get some real world feedback.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My intuition is to stay consistent w/ existing pod related metrics (we provide tekton_pipelines_controller_taskruns_pod_latency but not tekton_pipelines_controller_pipelineruns_pod_latency) and in favor of a smaller more focused PR; however if you feel it's helpful/convenient to have this at the pipelineRun level that's fine too! especially considering it's an experimental metric, we could always decide to remove it later if it's not helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no that comparison with pod latency resonates @lbernick
I'll remove the pipelinerun related metric
thanks
Thanks for your PR @gabemontero! would you mind including a commit message/PR description with a quick summary of the changes and why you're making them? |
2a4b0f2
to
7c3dc5e
Compare
…rce quotas or k8s live node constraints This commit adds a new experimental gauge metrics that counts the number of TaskRuns whose underlying Pods are currently not scheduled to run by Kubernetes: - one metric counts when Kubernetes ResourceQuota policies within a Namespace prevent scheduling - a second metric counts when underlying Node level CPU or Memory utilization are such that the underlying Pod cannot be scheduled
7c3dc5e
to
93efb3c
Compare
The following is the coverage report on the affected files.
|
OK @lbernick I believe I've pushed changes that address all the in-line code discussion, and I've attempted to address your commit message / PR description point ^^, though I'm not 100% sure I covered everything you might have expected with the description / summary, so please advise if I missed a nuance there. PTAL and thanks :-) |
The following is the coverage report on the affected files.
|
The following is the coverage report on the affected files.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
hey @vdemeester - WDYT about cherry picking back to 0.47.x ? |
I think it does make sense yes. @lbernick @afrittoli @jerop SGTY ? |
I'm totally fine with backporting this commit, just want to make sure we don't have the expectation that new features in general will be backported, just bug fixes. /cherry-pick release-v0.47.x |
@lbernick: new pull request created: #6774 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
yep agreed @lbernick - the relative small size of the change in the end was the only rationale that lead me to even asking thanks for agreeing
|
I think metrics on |
Back when implementing tektoncd#6744 for tektoncd#6631 we failed to realize that with k8s quota policies being namespace scoped, knowing which namespace the throttled items were in could have some diagnostic value. Now that we have been using the metric added for a bit, this realization is now very apparent. This changes introduces the namespace tag. Also, since last touching this space, the original metric was deprecated and a new one with a shorter name was added. This change only updates the non-deprecated metric with the new label. rh-pre-commit.version: 2.2.0 rh-pre-commit.check-secrets: ENABLED
Back when implementing tektoncd#6744 for tektoncd#6631 we failed to realize that with k8s quota policies being namespace scoped, knowing which namespace the throttled items were in could have some diagnostic value. Now that we have been using the metric added for a bit, this realization is now very apparent. This changes introduces the namespace tag. Also, since last touching this space, the original metric was deprecated and a new one with a shorter name was added. This change only updates the non-deprecated metric with the new label. rh-pre-commit.version: 2.2.0 rh-pre-commit.check-secrets: ENABLED
Back when implementing tektoncd#6744 for tektoncd#6631 we failed to realize that with k8s quota policies being namespace scoped, knowing which namespace the throttled items were in could have some diagnostic value. Now that we have been using the metric added for a bit, this realization is now very apparent. This changes introduces the namespace tag. Also, since last touching this space, the original metric was deprecated and a new one with a shorter name was added. This change only updates the non-deprecated metric with the new label. rh-pre-commit.version: 2.2.0 rh-pre-commit.check-secrets: ENABLED
Back when implementing tektoncd#6744 for tektoncd#6631 we failed to realize that with k8s quota policies being namespace scoped, knowing which namespace the throttled items were in could have some diagnostic value. Now that we have been using the metric added for a bit, this realization is now very apparent. This changes introduces the namespace tag. Also, since last touching this space, the original metric was deprecated and a new one with a shorter name was added. This change only updates the non-deprecated metric with the new label. Lastly, the default behavior is preserved, and use of the new label only occurs when explicitly enabled in observability config map.
Back when implementing tektoncd#6744 for tektoncd#6631 we failed to realize that with k8s quota policies being namespace scoped, knowing which namespace the throttled items were in could have some diagnostic value. Now that we have been using the metric added for a bit, this realization is now very apparent. This changes introduces the namespace tag. Also, since last touching this space, the original metric was deprecated and a new one with a shorter name was added. This change only updates the non-deprecated metric with the new label. Lastly, the default behavior is preserved, and use of the new label only occurs when explicitly enabled in observability config map.
Back when implementing tektoncd#6744 for tektoncd#6631 we failed to realize that with k8s quota policies being namespace scoped, knowing which namespace the throttled items were in could have some diagnostic value. Now that we have been using the metric added for a bit, this realization is now very apparent. This changes introduces the namespace tag. Also, since last touching this space, the original metric was deprecated and a new one with a shorter name was added. This change only updates the non-deprecated metric with the new label. Lastly, the default behavior is preserved, and use of the new label only occurs when explicitly enabled in observability config map.
Back when implementing tektoncd#6744 for tektoncd#6631 we failed to realize that with k8s quota policies being namespace scoped, knowing which namespace the throttled items were in could have some diagnostic value. Now that we have been using the metric added for a bit, this realization is now very apparent. This changes introduces the namespace tag. Also, since last touching this space, the original metric was deprecated and a new one with a shorter name was added. This change only updates the non-deprecated metric with the new label. Lastly, the default behavior is preserved, and use of the new label only occurs when explicitly enabled in observability config map.
Back when implementing #6744 for #6631 we failed to realize that with k8s quota policies being namespace scoped, knowing which namespace the throttled items were in could have some diagnostic value. Now that we have been using the metric added for a bit, this realization is now very apparent. This changes introduces the namespace tag. Also, since last touching this space, the original metric was deprecated and a new one with a shorter name was added. This change only updates the non-deprecated metric with the new label. Lastly, the default behavior is preserved, and use of the new label only occurs when explicitly enabled in observability config map.
Changes
Fixes #6631
/kind feature
This commit adds a new experimental gauge metrics that counts the number of TaskRuns whose
underlying Pods are currently not scheduled to run by Kubernetes:
cannot be scheduled
Submitter Checklist
As the author of this PR, please check off the items in this checklist:
/kind <type>
. Valid types are bug, cleanup, design, documentation, feature, flake, misc, question, tepRelease Notes
@vdemeester @khrm ptal