-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Surface "stale" GroupVersions from AggregatedDiscovery #116145
Conversation
/sig api-machinery |
/assign @Jefftree |
/retest |
gv := schema.GroupVersion{Group: g.Name, Version: v.Version} | ||
if v.Freshness == apidiscovery.DiscoveryFreshnessStale { | ||
klog.V(5).Infof("stale group/version omitted from discovery: %v", gv) | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the legacy scenario, we return something in the error &ErrorGroupDiscoveryFailed{Groups: failedGroups}
https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/client-go/discovery/discovery_client.go#L428 for any gvs that were unreachable. Can we propagate this information to ServerGroupsAndResources()
in the aggregated discovery as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed. Lots of plumbing complexity. Let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to return the resources even if the gv is stale? I noticed that's what we do currently https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/client-go/discovery/discovery_client.go#L531, though not really sure about the use case for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This question still stands but I don't know what the answer should be. Mostly pertains to aggregated apiservers being unavailable since local apiservers almost always will not encounter this. cc @deads2k
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think unaggregated discovery is returning resources from failed GV's. The resources are only added if they are non-nil here: https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/client-go/discovery/discovery_client.go#L529. But if the GV failed, then the resource list is always nil here: https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/client-go/discovery/discovery_client.go#L345. We can discount the 404, core/v1 toleration here: https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/client-go/discovery/discovery_client.go#L342. The comment here, I believe is just wrong now.
// even in case of error, some fallback might have been returned
groupVersionResources[groupVersion] = apiResourceList
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, the resource list will always be nil in https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/client-go/discovery/discovery_client.go#L529, thanks for confirming! Yeah I agree the comment is wrong, we shouldn't ever hit that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not having resources for stale discovery information seems valid since otherwise we will not converge with kube-apiservers that have restarted. If we eventually have a persisted set of resources so stale resources would be consistent among all kube-apiservers, I'd be in favor of doing so.
/retest |
1 similar comment
/retest |
/retest |
1 similar comment
/retest |
/retest |
@@ -24,44 +24,72 @@ import ( | |||
"k8s.io/apimachinery/pkg/runtime/schema" | |||
) | |||
|
|||
// StaleGroupVersionError encasulates failed GroupVersion marked "stale" | |||
// in the returned AggregatedDiscovery format. | |||
type StaleGroupVersionError struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not reuse the existing struct? https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/client-go/discovery/discovery_client.go#L358
I'd imagine we may have clients who already cast to that type already and introducing a new type might be a bit confusing? Logically, stale is equivalent to discovery failed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ErrGroupDiscoveryFailed
error is supposed to encasulate all the Group/Versions which failed, which is why it stores a map[schema.GroupVersion]error
. So it would be awkward (especially printing the error) if there were a hierarchy of these errors. BTW, the ErrGroupDiscoveryFailed
error is what gets returned to the discovery interface user storing the individual errors in the map (as an example, see ServerGroupsAndResources
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
// must be surfaced to the caller as failed Group/Versions. | ||
var ferr error | ||
if len(failedGVs) > 0 { | ||
ferr = &ErrGroupDiscoveryFailed{Groups: failedGVs} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Jefftree Example: ErrGroupDiscoveryFailed
encapsulates the stale GV's when it is returned here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah got it, it's still encapsulated. thank you!
version.GroupVersion = gv.String() | ||
version.Version = v.Version | ||
group.Versions = append(group.Versions, version) | ||
if i == 0 { | ||
// PreferredVersion is first non-stale Version |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this true? I'd imagine the preferred gv wouldn't change based on stale and should be based on what was registered with the server?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good question; I'd like to hear from others. I'm not sure it would be possible to have the PreferredVersion
be a failed GroupVersion
. The ServerPreferredResources
and ServerPreferredNamespacedResources
would be returning resources from failed GV's. In fact, those resource lists would probably be empty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the current aggregated functionality of not allowing a failed GroupVersion
be the PreferredVersion
is correct. It appears to be the same as the unaggregated functionality. The following code shows that failed GroupVersion
discovery requests return a nil
for the resource list: https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/client-go/discovery/discovery_client.go#L345. So a failed GroupVersion
is not a PreferredVersion
in the unaggregated case.
/cc @deads2k |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall it looks like it fits nicely, but I think there may be a problem with aggregate apiservers that don't include /api. Commented down below.
version.GroupVersion = gv.String() | ||
version.Version = v.Version | ||
group.Versions = append(group.Versions, version) | ||
if i == 0 { | ||
// PreferredVersion is first non-stale Version | ||
if pvSet == false { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is pvSet
different than len(group.PreferredVersion) > 0
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the pvSet
, now comparing against the empty struct. group.PreferredVersion
does not support len
;
gv := schema.GroupVersion{Group: g.Name, Version: v.Version} | ||
if v.Freshness == apidiscovery.DiscoveryFreshnessStale { | ||
klog.V(5).Infof("stale group/version omitted from discovery: %v", gv) | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not having resources for stale discovery information seems valid since otherwise we will not converge with kube-apiservers that have restarted. If we eventually have a persisted set of resources so stale resources would be consistent among all kube-apiservers, I'd be in favor of doing so.
@@ -233,21 +247,22 @@ func (d *DiscoveryClient) downloadLegacy() (*metav1.APIGroupList, map[schema.Gro | |||
if err != nil { | |||
// Tolerate 404, since aggregated api servers can return it. | |||
if errors.IsNotFound(err) { | |||
return &metav1.APIGroupList{}, nil, nil | |||
return &metav1.APIGroupList{}, nil, nil, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if err == nil
shouldn't the failedGVs
be non-nil, so you can add to it a few lines up in the diff? If so, I think this would indicate a missing unit test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are correct...and digging into this I found that aggregated discovery was not correctly tolerating 404
for core/v1
. I have rectified this, and added a new unit test. Please have a look.
/triage accepted |
Thanks, this lgtm. I see @Jefftree is still in here so I'll approve and leave lgtm with him. /approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: deads2k, seans3 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
All my comments were addressed, thanks Sean! /lgtm |
LGTM label has been added. Git tree hash: ffd951a9cee8713042f85172c6e5b6ebee691a66
|
Freshness
field. This information is now surfaced to the Discovery interface callers within existingErrGroupDiscoveryFailed
errors.GroupVersion
as anErrGroupDiscoveryFailed
; now AggregatedDiscovery also surfaces failed discoveryGroupVersion
.client-go/discovery
: 84.6% -> 85.9%client-go/discovery/cached/memory
: 88.2% -> 88.9%/kind cleanup