-
Notifications
You must be signed in to change notification settings - Fork 303
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add metrics for endpoint and endpoint slice state #1919
Conversation
Hi @sawsa307. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/assign @swetharepakula |
dd692aa
to
11f5710
Compare
/ok-to-test |
11f5710
to
767e13b
Compare
767e13b
to
a74b700
Compare
0ac3c6b
to
f769999
Compare
f769999
to
e206616
Compare
9f133fd
to
e36d9fe
Compare
for state, count := range epsStateCount { | ||
label := string(state) | ||
if state != negtypes.Total { | ||
label = fmt.Sprintf("Contains%sEndpoint", string(label)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to do this. We can use the labels as they are currently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated!
pkg/neg/metrics/endpoint_metrics.go
Outdated
|
||
syncerEndpointSliceStateLabels = []string{ | ||
"endpoint_slice_state", // state of endpoint slice | ||
"calculation_type", // type of endpoint calculation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't need this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated!
pkg/neg/syncers/utils.go
Outdated
// pod is used for label propagation | ||
_, getPodErr := getEndpointPod(endpointAddress, podLister) | ||
klog.Errorf("Detected unexpected error when getting zone: %v", getZoneErr) | ||
if returnErr == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this isn't a good pattern to have of storing the error we want to return. I also don't think we gain much from trying to save this error because we will immediately have another sync right after with degraded mode that will give us these metrics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated. Now we return the error immediately.
pkg/neg/syncers/utils.go
Outdated
localEPCount[negtypes.PodLabelMismatch] += validatePodStat.podLabelMismatch | ||
} | ||
if getPodErr != nil || validateErr != nil || getZoneErr != nil || checkIPErr != nil || otherError != nil { | ||
localEPCount[negtypes.InvalidField] += 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is otherError being used for? Initially I intended for InvalidField when the number of error classifications were less, but we are having more detail, I don't think we need to track by InvalidField
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is used as a catch-all. Since we no longer need to track invalidField, it is removed.
pkg/neg/syncers/utils.go
Outdated
klog.Errorf("Endpoint %q in Endpoints %s/%s correponds to an invalid pod: %v, skipping", endpointAddress.Addresses, ed.Meta.Namespace, ed.Meta.Name, validateErr) | ||
localEPCount[negtypes.PodTerminal] += validatePodStat.podTerminal | ||
localEPCount[negtypes.NodeNotFound] += validatePodStat.nodeNotFound | ||
localEPCount[negtypes.NodeTypeAssertionFailed] += validatePodStat.nodeTypeAssertionFailed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets count this as an other error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated. Thanks!
pkg/neg/syncers/utils.go
Outdated
klog.Errorf("Endpoint %q in Endpoints %s/%s receives error when getting pod: %v, skipping", endpointAddress.Addresses, ed.Meta.Namespace, ed.Meta.Name, getPodErr) | ||
localEPCount[negtypes.PodMissing] += getPodStat.podMissing | ||
localEPCount[negtypes.PodNotFound] += getPodStat.podNotFound | ||
localEPCount[negtypes.PodTypeAssertionFailed] += getPodStat.podTypeAssertionFailed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
count this as other error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated. Thanks!
pkg/neg/syncers/utils.go
Outdated
localEPCount[negtypes.PodMissing] += getPodStat.podMissing | ||
localEPCount[negtypes.PodNotFound] += getPodStat.podNotFound |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these two are similar. I would consolidate these into a single classification
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated. Thanks!
e36d9fe
to
8c2b996
Compare
pkg/neg/syncers/utils.go
Outdated
type getPodStat struct { | ||
podMissing int | ||
podNotFound int | ||
podTypeAssertionFailed int | ||
podInvalid int | ||
otherError int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of using a stat struct, would it be easier to return a map with the counts of the errors? Then you can use the same merge function to merge it into the local count. You can do something similar with all the other function. This will reduce some code as you won't have to enumerate through all the errors after calling a helper function. It will also reduce the need in knowing what errors are returned by a helper function.
5bc8d83
to
65abd66
Compare
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
/retest |
1 similar comment
/retest |
Added metrics to collect the state of each endpoint and endpoint slice. These metrics are only for L7 endpoints.
65abd66
to
de9e231
Compare
@@ -145,7 +155,7 @@ func (sm *SyncerMetrics) DeleteSyncer(key negtypes.NegSyncerKey) { | |||
defer sm.mu.Unlock() | |||
delete(sm.syncerStatusMap, key) | |||
delete(sm.syncerEndpointStateMap, key) | |||
delete(sm.syncerEPSStateMap, key) | |||
delete(sm.syncerEndpointSliceStateMap, key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolve conflict.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: sawsa307, swetharepakula The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Added metrics to collect the state of each endpoint and endpoint slice. This metrics is only for L7 endpoints.