Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add metrics for endpoint and endpoint slice state #1919

Merged
merged 4 commits into from
May 3, 2023

Conversation

sawsa307
Copy link
Contributor

@sawsa307 sawsa307 commented Jan 26, 2023

Added metrics to collect the state of each endpoint and endpoint slice. This metrics is only for L7 endpoints.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jan 26, 2023
@k8s-ci-robot
Copy link
Contributor

Hi @sawsa307. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@sawsa307
Copy link
Contributor Author

/assign @swetharepakula

@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Jan 26, 2023
@sawsa307 sawsa307 force-pushed the neg-ep-state-metric branch 3 times, most recently from dd692aa to 11f5710 Compare February 2, 2023 19:35
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 4, 2023
@sawsa307 sawsa307 marked this pull request as draft February 4, 2023 05:51
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 4, 2023
@bowei
Copy link
Member

bowei commented Feb 9, 2023

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 9, 2023
@sawsa307 sawsa307 force-pushed the neg-ep-state-metric branch from 11f5710 to 767e13b Compare March 1, 2023 00:40
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Mar 1, 2023
@sawsa307 sawsa307 force-pushed the neg-ep-state-metric branch from 767e13b to a74b700 Compare March 1, 2023 22:07
@sawsa307 sawsa307 marked this pull request as ready for review March 1, 2023 22:07
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 1, 2023
@k8s-ci-robot k8s-ci-robot requested a review from freehan March 1, 2023 22:07
@sawsa307 sawsa307 force-pushed the neg-ep-state-metric branch 2 times, most recently from 0ac3c6b to f769999 Compare March 5, 2023 11:07
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 9, 2023
@sawsa307 sawsa307 force-pushed the neg-ep-state-metric branch from f769999 to e206616 Compare March 21, 2023 22:07
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 21, 2023
@sawsa307 sawsa307 force-pushed the neg-ep-state-metric branch 4 times, most recently from 9f133fd to e36d9fe Compare May 1, 2023 19:12
for state, count := range epsStateCount {
label := string(state)
if state != negtypes.Total {
label = fmt.Sprintf("Contains%sEndpoint", string(label))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to do this. We can use the labels as they are currently.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated!


syncerEndpointSliceStateLabels = []string{
"endpoint_slice_state", // state of endpoint slice
"calculation_type", // type of endpoint calculation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't need this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated!

// pod is used for label propagation
_, getPodErr := getEndpointPod(endpointAddress, podLister)
klog.Errorf("Detected unexpected error when getting zone: %v", getZoneErr)
if returnErr == nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this isn't a good pattern to have of storing the error we want to return. I also don't think we gain much from trying to save this error because we will immediately have another sync right after with degraded mode that will give us these metrics

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. Now we return the error immediately.

localEPCount[negtypes.PodLabelMismatch] += validatePodStat.podLabelMismatch
}
if getPodErr != nil || validateErr != nil || getZoneErr != nil || checkIPErr != nil || otherError != nil {
localEPCount[negtypes.InvalidField] += 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is otherError being used for? Initially I intended for InvalidField when the number of error classifications were less, but we are having more detail, I don't think we need to track by InvalidField

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is used as a catch-all. Since we no longer need to track invalidField, it is removed.

klog.Errorf("Endpoint %q in Endpoints %s/%s correponds to an invalid pod: %v, skipping", endpointAddress.Addresses, ed.Meta.Namespace, ed.Meta.Name, validateErr)
localEPCount[negtypes.PodTerminal] += validatePodStat.podTerminal
localEPCount[negtypes.NodeNotFound] += validatePodStat.nodeNotFound
localEPCount[negtypes.NodeTypeAssertionFailed] += validatePodStat.nodeTypeAssertionFailed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets count this as an other error

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. Thanks!

klog.Errorf("Endpoint %q in Endpoints %s/%s receives error when getting pod: %v, skipping", endpointAddress.Addresses, ed.Meta.Namespace, ed.Meta.Name, getPodErr)
localEPCount[negtypes.PodMissing] += getPodStat.podMissing
localEPCount[negtypes.PodNotFound] += getPodStat.podNotFound
localEPCount[negtypes.PodTypeAssertionFailed] += getPodStat.podTypeAssertionFailed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

count this as other error

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. Thanks!

Comment on lines 447 to 448
localEPCount[negtypes.PodMissing] += getPodStat.podMissing
localEPCount[negtypes.PodNotFound] += getPodStat.podNotFound
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these two are similar. I would consolidate these into a single classification

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. Thanks!

@sawsa307 sawsa307 force-pushed the neg-ep-state-metric branch from e36d9fe to 8c2b996 Compare May 1, 2023 23:35
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels May 1, 2023
Comment on lines 369 to 371
type getPodStat struct {
podMissing int
podNotFound int
podTypeAssertionFailed int
podInvalid int
otherError int
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using a stat struct, would it be easier to return a map with the counts of the errors? Then you can use the same merge function to merge it into the local count. You can do something similar with all the other function. This will reduce some code as you won't have to enumerate through all the errors after calling a helper function. It will also reduce the need in knowing what errors are returned by a helper function.

@sawsa307 sawsa307 force-pushed the neg-ep-state-metric branch 3 times, most recently from 5bc8d83 to 65abd66 Compare May 2, 2023 22:30
@swetharepakula
Copy link
Member

/retest

Copy link
Member

@swetharepakula swetharepakula left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels May 2, 2023
@sawsa307
Copy link
Contributor Author

sawsa307 commented May 3, 2023

/retest

1 similar comment
@sawsa307
Copy link
Contributor Author

sawsa307 commented May 3, 2023

/retest

@sawsa307 sawsa307 force-pushed the neg-ep-state-metric branch from 65abd66 to de9e231 Compare May 3, 2023 02:29
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 3, 2023
@@ -145,7 +155,7 @@ func (sm *SyncerMetrics) DeleteSyncer(key negtypes.NegSyncerKey) {
defer sm.mu.Unlock()
delete(sm.syncerStatusMap, key)
delete(sm.syncerEndpointStateMap, key)
delete(sm.syncerEPSStateMap, key)
delete(sm.syncerEndpointSliceStateMap, key)
Copy link
Contributor Author

@sawsa307 sawsa307 May 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolve conflict.

Copy link
Member

@swetharepakula swetharepakula left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 3, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sawsa307, swetharepakula

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 0a8854d into kubernetes:master May 3, 2023
@sawsa307 sawsa307 deleted the neg-ep-state-metric branch September 2, 2023 20:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants