Surface "stale" GroupVersions from AggregatedDiscovery #116145

seans3 · 2023-02-28T19:48:09Z

AggregatedDiscovery marks GroupVersions that could not be retrieved as "stale" in a Freshness field. This information is now surfaced to the Discovery interface callers within existing ErrGroupDiscoveryFailed errors.
Non AggregatedDiscovery already returns a failed discovery GroupVersion as an ErrGroupDiscoveryFailed; now AggregatedDiscovery also surfaces failed discovery GroupVersion.
Adds unit tests for this new functionality. Test coverage increases:
- client-go/discovery: 84.6% -> 85.9%
- client-go/discovery/cached/memory: 88.2% -> 88.9%

/kind cleanup

NONE

Aggregated Discovery KEP

seans3 · 2023-02-28T19:48:39Z

/sig api-machinery
/priority important-soon

seans3 · 2023-02-28T19:49:43Z

/assign @Jefftree

seans3 · 2023-02-28T20:00:12Z

/retest

Jefftree · 2023-02-28T20:24:51Z

staging/src/k8s.io/client-go/discovery/aggregated_discovery.go

 		gv := schema.GroupVersion{Group: g.Name, Version: v.Version}
+		if v.Freshness == apidiscovery.DiscoveryFreshnessStale {
+			klog.V(5).Infof("stale group/version omitted from discovery: %v", gv)
+			continue


In the legacy scenario, we return something in the error &ErrorGroupDiscoveryFailed{Groups: failedGroups} https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/client-go/discovery/discovery_client.go#L428 for any gvs that were unreachable. Can we propagate this information to ServerGroupsAndResources() in the aggregated discovery as well?

Addressed. Lots of plumbing complexity. Let me know what you think.

Does it make sense to return the resources even if the gv is stale? I noticed that's what we do currently https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/client-go/discovery/discovery_client.go#L531, though not really sure about the use case for it.

This question still stands but I don't know what the answer should be. Mostly pertains to aggregated apiservers being unavailable since local apiservers almost always will not encounter this. cc @deads2k

I don't think unaggregated discovery is returning resources from failed GV's. The resources are only added if they are non-nil here: https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/client-go/discovery/discovery_client.go#L529. But if the GV failed, then the resource list is always nil here: https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/client-go/discovery/discovery_client.go#L345. We can discount the 404, core/v1 toleration here: https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/client-go/discovery/discovery_client.go#L342. The comment here, I believe is just wrong now.

// even in case of error, some fallback might have been returned groupVersionResources[groupVersion] = apiResourceList

Ah, the resource list will always be nil in https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/client-go/discovery/discovery_client.go#L529, thanks for confirming! Yeah I agree the comment is wrong, we shouldn't ever hit that case.

Not having resources for stale discovery information seems valid since otherwise we will not converge with kube-apiservers that have restarted. If we eventually have a persisted set of resources so stale resources would be consistent among all kube-apiservers, I'd be in favor of doing so.

seans3 · 2023-03-01T00:51:11Z

/retest

seans3 · 2023-03-01T17:02:51Z

/retest

seans3 · 2023-03-03T07:15:48Z

/retest

seans3 · 2023-03-03T09:25:25Z

/retest

seans3 · 2023-03-06T02:34:19Z

/retest

Jefftree · 2023-03-06T18:06:12Z

staging/src/k8s.io/client-go/discovery/aggregated_discovery.go

@@ -24,44 +24,72 @@ import (
 	"k8s.io/apimachinery/pkg/runtime/schema"
 )

+// StaleGroupVersionError encasulates failed GroupVersion marked "stale"
+// in the returned AggregatedDiscovery format.
+type StaleGroupVersionError struct {


Why not reuse the existing struct? https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/client-go/discovery/discovery_client.go#L358

I'd imagine we may have clients who already cast to that type already and introducing a new type might be a bit confusing? Logically, stale is equivalent to discovery failed.

The ErrGroupDiscoveryFailed error is supposed to encasulate all the Group/Versions which failed, which is why it stores a map[schema.GroupVersion]error. So it would be awkward (especially printing the error) if there were a hierarchy of these errors. BTW, the ErrGroupDiscoveryFailed error is what gets returned to the discovery interface user storing the individual errors in the map (as an example, see ServerGroupsAndResources).

seans3 · 2023-03-06T22:59:35Z

staging/src/k8s.io/client-go/discovery/discovery_client.go

+		// must be surfaced to the caller as failed Group/Versions.
+		var ferr error
+		if len(failedGVs) > 0 {
+			ferr = &ErrGroupDiscoveryFailed{Groups: failedGVs}


@Jefftree Example: ErrGroupDiscoveryFailed encapsulates the stale GV's when it is returned here.

Ah got it, it's still encapsulated. thank you!

Jefftree · 2023-03-06T23:51:48Z

staging/src/k8s.io/client-go/discovery/aggregated_discovery.go

 		version.GroupVersion = gv.String()
 		version.Version = v.Version
 		group.Versions = append(group.Versions, version)
-		if i == 0 {
+		// PreferredVersion is first non-stale Version


Is this true? I'd imagine the preferred gv wouldn't change based on stale and should be based on what was registered with the server?

This is a good question; I'd like to hear from others. I'm not sure it would be possible to have the PreferredVersion be a failed GroupVersion. The ServerPreferredResources and ServerPreferredNamespacedResources would be returning resources from failed GV's. In fact, those resource lists would probably be empty.

I believe the current aggregated functionality of not allowing a failed GroupVersion be the PreferredVersion is correct. It appears to be the same as the unaggregated functionality. The following code shows that failed GroupVersion discovery requests return a nil for the resource list: https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/client-go/discovery/discovery_client.go#L345. So a failed GroupVersion is not a PreferredVersion in the unaggregated case.

seans3 · 2023-03-07T00:10:39Z

/cc @deads2k

deads2k

Overall it looks like it fits nicely, but I think there may be a problem with aggregate apiservers that don't include /api. Commented down below.

deads2k · 2023-03-07T19:15:53Z

staging/src/k8s.io/client-go/discovery/aggregated_discovery.go

 		version.GroupVersion = gv.String()
 		version.Version = v.Version
 		group.Versions = append(group.Versions, version)
-		if i == 0 {
+		// PreferredVersion is first non-stale Version
+		if pvSet == false {


is pvSet different than len(group.PreferredVersion) > 0?

I removed the pvSet, now comparing against the empty struct. group.PreferredVersion does not support len;

deads2k · 2023-03-07T19:20:33Z

staging/src/k8s.io/client-go/discovery/aggregated_discovery.go

 		gv := schema.GroupVersion{Group: g.Name, Version: v.Version}
+		if v.Freshness == apidiscovery.DiscoveryFreshnessStale {
+			klog.V(5).Infof("stale group/version omitted from discovery: %v", gv)
+			continue


Not having resources for stale discovery information seems valid since otherwise we will not converge with kube-apiservers that have restarted. If we eventually have a persisted set of resources so stale resources would be consistent among all kube-apiservers, I'd be in favor of doing so.

deads2k · 2023-03-07T19:31:17Z

staging/src/k8s.io/client-go/discovery/discovery_client.go

@@ -233,21 +247,22 @@ func (d *DiscoveryClient) downloadLegacy() (*metav1.APIGroupList, map[schema.Gro
 	if err != nil {
 		// Tolerate 404, since aggregated api servers can return it.
 		if errors.IsNotFound(err) {
-			return &metav1.APIGroupList{}, nil, nil
+			return &metav1.APIGroupList{}, nil, nil, nil


if err == nil shouldn't the failedGVs be non-nil, so you can add to it a few lines up in the diff? If so, I think this would indicate a missing unit test.

You are correct...and digging into this I found that aggregated discovery was not correctly tolerating 404 for core/v1. I have rectified this, and added a new unit test. Please have a look.

fedebongio · 2023-03-07T21:14:27Z

/triage accepted

deads2k · 2023-03-08T20:55:25Z

Thanks, this lgtm. I see @Jefftree is still in here so I'll approve and leave lgtm with him.

/approve

k8s-ci-robot · 2023-03-08T20:55:46Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k, seans3

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~staging/src/k8s.io/client-go/OWNERS~~ [deads2k]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Jefftree · 2023-03-08T21:00:29Z

All my comments were addressed, thanks Sean!

/lgtm

k8s-ci-robot · 2023-03-08T21:00:36Z

LGTM label has been added.

Git tree hash: ffd951a9cee8713042f85172c6e5b6ebee691a66

#115865-origin-release-1.26 Automated cherry pick of #116145: Plumb stale GroupVersions through aggregated discovery #115865: Removes old discovery hack ignoring 403 and 404

seans3 marked this pull request as draft February 28, 2023 19:49

k8s-ci-robot requested review from aojea and apelisse February 28, 2023 19:49

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 28, 2023

k8s-ci-robot assigned Jefftree Feb 28, 2023

Jefftree reviewed Feb 28, 2023

View reviewed changes

seans3 force-pushed the discovery-stale branch from 52d17a4 to 7ad09c2 Compare March 1, 2023 00:12

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Mar 1, 2023

seans3 changed the title ~~Exclude stale GroupVersions from returned aggregated discovery~~ Plumb stale GroupVersions through aggregated discovery Mar 1, 2023

seans3 force-pushed the discovery-stale branch from 7ad09c2 to 6c91834 Compare March 3, 2023 06:32

seans3 force-pushed the discovery-stale branch from 6c91834 to 2d51ac7 Compare March 6, 2023 01:16

seans3 changed the title ~~Plumb stale GroupVersions through aggregated discovery~~ Surface stale GroupVersions through aggregated discovery Mar 6, 2023

seans3 changed the title ~~Surface stale GroupVersions through aggregated discovery~~ Surface "stale" GroupVersions from AggregatedDiscovery Mar 6, 2023

seans3 force-pushed the discovery-stale branch from f765c20 to c474ddb Compare March 6, 2023 05:56

Jefftree reviewed Mar 6, 2023

View reviewed changes

seans3 commented Mar 6, 2023

View reviewed changes

Jefftree reviewed Mar 6, 2023

View reviewed changes

k8s-ci-robot requested a review from deads2k March 7, 2023 00:10

deads2k requested changes Mar 7, 2023

View reviewed changes

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 7, 2023

seans3 force-pushed the discovery-stale branch from c474ddb to 95c0eda Compare March 7, 2023 22:01

Plumb stale GroupVersions through aggregated discovery

fed7bb7

seans3 force-pushed the discovery-stale branch from 95c0eda to fed7bb7 Compare March 8, 2023 03:28

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 8, 2023

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 8, 2023

k8s-ci-robot merged commit b1ba5c5 into kubernetes:master Mar 8, 2023

k8s-ci-robot added this to the v1.27 milestone Mar 8, 2023

deads2k mentioned this pull request Mar 9, 2023

Enable Aggregated Discovery for Beta #116108

Merged

This was referenced Mar 9, 2023

Automated cherry pick of #116145: Plumb stale GroupVersions through aggregated discovery #116434

Closed

Automated cherry pick of #116145: Plumb stale GroupVersions through aggregated discovery #115865: Removes old discovery hack ignoring 403 and 404 #116437

Merged

Jefftree mentioned this pull request Mar 10, 2023

Aggregated Discovery kubernetes/enhancements#3352

Closed

11 tasks

squeed mentioned this pull request May 8, 2023

Cilium-operator status Crashloopbackoff cilium/cilium#25280

Closed

2 tasks

Surface "stale" GroupVersions from AggregatedDiscovery #116145

Surface "stale" GroupVersions from AggregatedDiscovery #116145

Conversation

seans3 commented Feb 28, 2023 • edited Loading

seans3 commented Feb 28, 2023

seans3 commented Feb 28, 2023

seans3 commented Feb 28, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

seans3 Mar 7, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

seans3 commented Mar 1, 2023

seans3 commented Mar 1, 2023

seans3 commented Mar 3, 2023

seans3 commented Mar 3, 2023

seans3 commented Mar 6, 2023

Choose a reason for hiding this comment

seans3 Mar 6, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jefftree Mar 6, 2023 • edited Loading

Choose a reason for hiding this comment

seans3 Mar 7, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

seans3 commented Mar 7, 2023

deads2k left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fedebongio commented Mar 7, 2023

deads2k commented Mar 8, 2023

k8s-ci-robot commented Mar 8, 2023

Jefftree commented Mar 8, 2023

k8s-ci-robot commented Mar 8, 2023

seans3 commented Feb 28, 2023 •

edited

Loading

seans3 Mar 7, 2023 •

edited

Loading

seans3 Mar 6, 2023 •

edited

Loading

Jefftree Mar 6, 2023 •

edited

Loading

seans3 Mar 7, 2023 •

edited

Loading