Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UVIP : Add beta criteria #4441

Closed
wants to merge 2 commits into from
Closed

Conversation

richabanker
Copy link
Contributor

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory labels Jan 25, 2024
@k8s-ci-robot k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 25, 2024
@richabanker richabanker marked this pull request as draft January 25, 2024 21:29
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 25, 2024
@k8s-ci-robot
Copy link
Contributor

@richabanker: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-enhancements-test cfb02c5 link true /test pull-enhancements-test

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

necessary that discovery exactly matches the capability of the system. So,
we will use the storage version objects to reconstruct a merged discovery
document and serve that in all apiservers.
- We will proxy the discovery request to the newest available apiserver in the cluster. In doing so, we will ensure that all discovery requests are handled keeping the future state of the cluster in mind.
Copy link
Contributor

@jpbetz jpbetz Jan 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a good simplification. The complexity of merging across versions struck me as something that would complicate the openapi code more than I want, and make caching optimizations even harder.

Note also that the compatibility version KEP requires introducing the ability to produce a discovery document for a version different than the binary version. The implementation will require conditionally showing APIs in discovery based on compatibility version. This feels like it could be extended to achieve what we want to achieve by merging. I'd rather not have two approaches to do basically the same thing in our codebase, so I'm in favor on letting the compatibility version KEP progress and keeping this feature simple (and since this is targeting a 1.31 implementation, we have some time). If we decide later that we need to do more around discovery, we can leverage the compatibility version work.

Thanks for working through the below list of garbage collection and namespace controller use cases.

Copy link
Member

@liggitt liggitt Jan 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't tell if this is proposing always redirecting discovery to the newest server, or only when servers disagree about the contents of discovery. I commented on both the GC and namespace lifecycle sections, but pointing both of those controllers at the new server in an upgrade that is stopping serving an API lets both of them orphan things. I thought we were trying to fix that gap.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking the former, i.e. always redirecting discovery to the latest apiserver.
Thanks for clarifying my understanding of the GC and NLC behaviors. Yeah need to think more on how to handle the scenario where an API is not served by the latest apiserver.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, thinking more on how/whether to handle discovery requests as a part of UVIP I am seeing the following options:
a). implement a merged discovery doc as was originally proposed
b). route discovery to newest apiserver
c). do nothing (drop discovery handling from UVIP)

Going by #4441 (comment), the compatibility version feature will make it possible to implement merging far more simply. With compatibility version we would be able to track historical API information, and if we use that to serve discovery, we also avoid the problem of GC and NLC becoming unaware of the previously-present-now-absent APIs (which happens if we route discovery to newest apiserver as a part of UVIP).

Maybe we just let discovery be as it is now and not address that for UVIP? Which means we avoid introducing significant changes that may not all make sense considering there will be future changes that compatibility versions will make to discovery. That would also avoid code complexity and developer time required to implement merging if we can get it mostly for free once compatibility version finishes.

TLDR: how about going with c). ?

Copy link
Contributor

@alexzielenski alexzielenski Jan 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note also that the compatibility version KEP requires introducing the ability to produce a discovery document for a version different than the binary version

My initial assumption is this wouldn't require a code change. If compatibility version KEP is changing the storage version & enabling/disabling certain versions before API registration, then I'd expect that the discovery document should only reflect the installed/enabled APIs; but I'd need to review the logic again to say for sure

c). do nothing (drop discovery handling from UVIP)
TLDR: how about going with c). ?

I've seen controllers use the discovery information as the source of truth as to what GVRs are available. IMO it would be surprising to be able to send a request to an apiserver and have it proxied/served, but that resource not be listed in discovery when I query the same service. Would controllers that rely on complete discovery be required to be aware of UVIP?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, once we make UVIP compatibility version aware, we will route client requests to only those apiservers that are compatible with the client. That is also one of the beta criteria added here. And the compatibility version KEP will provide discovery for a compatibility range. Agree that there will be some gap till when compatibility version aware discovery happens (UVIP should remain beta till then), but eventually we should be in a state where UVIP proxying and discovery are aligned.


* Version Aware Proxying:

- With [compatibility versions](https://github.com/kubernetes/enhancements/pull/4395) information available for kubernetes components, we can leverage that to perform version aware proxying between API servers, and ensure that a request is proxied to an eligible apiserver given that it was generated by a client compatible with the apiserver/s present in a cluster. Version aware proxying lets us ensure that requests from clients (possibly) expecting to use a feature introduced in later K8s versions, are not incorrectly routed to apiservers at older versions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to describe how clients would declare their version to servers here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a suggestion for specifying this information in client request headers. Also added a TODO section to flesh out the details of the same in Design Details.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Richabanker
Once this PR has been reviewed and has the lgtm label, please assign deads2k for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

This proposal will cause the GC to behave in the following way for the scenarios described:

- Case 1: if the resource is being deleted in the newer version
- Any actions taken on this resource during the upgrade will render useless after the upgrade when these objects are orphaned so we can act fast and proxy this request to the newer apiserver. Garbage collection will be triggered for a deleted resource, which is correct in this case
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GC does not trigger for an absent resource type ... discovery routing to the new server would mean GC would think the type did not exist any more and would hang cleaning up children of those objects. I'm not sure that's good.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh my initial(incorrect) understanding was that the newest apiserver will respond with the list of available resources in the new state and GC would assume that the resources not found in this list are safe to be deleted.

But if the GC will only act on resources from the list returned by the newest apiserver, then yes the objects from the deleted API will remain hanging waiting for something to claim them. :/
I need to find more details on how garbage collection works for K8s objects. Will update here once I have more clarity. Thanks for the review though!

- Case 1: if the resource is being deleted in the newer version
- Any actions taken on this resource during the upgrade will render useless after the upgrade when these objects are orphaned so we can act fast and proxy this request to the newer apiserver. Garbage collection will be triggered for a deleted resource, which is correct in this case
- Case 2: if the resource is being introduced in the newer version
- If the resource is being introduced in the newer version - proxying the discovery request to the newer apiserver is anyway the right thing to do. Garbage collection will be blocked which is also correct
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what "will be blocked" means... can you clarify that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I just meant that GC would see that the objects of the newly introduced API are available (as the discovery doc returned by the newest apiserver would reflect so). And if such objects are being referenced, it would not attempt to delete those.

This proposal will cause the NLC to behave in the following way for the scenarios described:

- Case 1: if the resource is being deleted in the newer version
- if the resource is actually being deleted in the newer version, namespace deletion will be triggered - which makes sense for the eventual state of the cluster.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Routing discovery to the new server means the namespace lifecycle controller will not be aware the resource type exists via discovery, and will orphan the resources rather than cleaning them up when handling a namespace deletion. That doesn't seem good.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated my understanding of the NLC's behavior. Though its unclear now how serving the latest discovery doc will solve this problem.

- Case 1: if the resource is being deleted in the newer version
- if the resource is actually being deleted in the newer version, namespace deletion will be triggered - which makes sense for the eventual state of the cluster.
- Case 2: If the resource is being introduced in the newer version
- namespace deletion is not triggered, as it should not - since these resources exist in the newer version of the cluster
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

namespace deletion is not triggered, as it should not

I don't understand this... routing discovery to the new server would make namespace controller aware of the new resource type, so when it is cleaning up any namespace, it will check for instances of that new type and delete them, like we want it to.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. Thanks.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 30, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle rotten
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 30, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Reopen this PR with /reopen
  • Mark this PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Reopen this PR with /reopen
  • Mark this PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants