Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Staging component changes in Knative #2639

Closed
mattmoor opened this issue Dec 5, 2018 · 9 comments
Closed

RFC: Staging component changes in Knative #2639

mattmoor opened this issue Dec 5, 2018 · 9 comments
Labels
area/test-and-release It flags unit/e2e/conformance/perf test issues for product features kind/doc Something isn't clear kind/feature Well-understood/specified features, ready for coding. kind/process Changes in how we work P1 P1
Milestone

Comments

@mattmoor
Copy link
Member

mattmoor commented Dec 5, 2018

We've had multiple instances now where we've wanted to make some sort of change to Knative where we cannot simply "rip the bandaid off" because it would make upgrades disruptive, which we need to become more disciplined about how we handle.

The goal of this proposal is to try and keep the install/upgrade process as close to kubectl apply -f foo.yaml as possible, by requiring that we stage the rollout of changes in a manner that doesn't break deployed services during the upgrade.

The general proposal in the abstract is:

  1. In 0.a, add a new component or feature
  2. In 0.b, enable that feature by default[1]
  3. In 0.c, require that feature to be enabled, deprecating things replaced
  4. In 0.d, remove deprecated old things (maybe just code).

Where: a <= b <= c < d

You might be thinking: "That's a lot of releases!", but I expect that for most rollouts a = b = c = d - 1.

[1] - If we want rollback safety then in some cases we may want this to be a < b.

Some real examples

Milestones here are purely illustrative

Example: kbuffer

  1. In 0.3, introduce the kbuffer as a replacement for the activator.
  2. In 0.3, have the Route controller rewrite ClusterIngress resources from the activator service to the kbuffer service.
  3. In 0.4, exclude the activator components from the release (and either --prune should cleanup, or a Job should)

Example: Reversing the flow of metrics

  1. In 0.3, introduce the metrics endpoint in queue-proxy
  2. In 0.3, have the Revision controller rewrite Deployment resources to use the new sidecar.
  3. In 0.4, start to take advantage of the fact that all Revisions have metrics endpoints.

Example: CRD versioning

  1. In 0.10, introduce v1beta1 format (not storage, for rollback safety)
  2. In 0.11, make v1beta1 the storage format (needs a forced update)
  3. In 0.12, stop serving v1alpha1.

The knative-ingressgateway is another example of where we should employ this staged migration.

@knative-prow-robot knative-prow-robot added area/test-and-release It flags unit/e2e/conformance/perf test issues for product features kind/doc Something isn't clear kind/feature Well-understood/specified features, ready for coding. kind/process Changes in how we work labels Dec 5, 2018
@mattmoor
Copy link
Member Author

mattmoor commented Dec 5, 2018

cc @vaikas-google @evankanderson

@mattmoor
Copy link
Member Author

mattmoor commented Dec 5, 2018

cc @dprotaso

Perhaps this influencing your thinking around how we'd rollout CRD sub-resources?

@mattmoor
Copy link
Member Author

mattmoor commented Dec 5, 2018

cc @tcnghia

Can you think through how we rework the ingress gateway change so that it's safe?

@evankanderson
Copy link
Member

Can we change the upgrade tool to be kubectl apply --prune --filename foo.yaml rather than kubectl apply --filename ...? (Where --prune will clean up other resources in the namespace with the given label.)

@jonjohnsonjr
Copy link
Contributor

If we add a Job for releases that do part of an upgrade, we'll need to write the inverse of that Job to do the downgrade. These seem mutually exclusive, unless we write the Job in a way that knows if it's an upgrade or a downgrade (which seems hard).

With this approach, it seems like we may need different yamls for fresh install vs upgrade vs downgrade.

@mattmoor
Copy link
Member Author

mattmoor commented Dec 9, 2018

@evankanderson Yeah, I had this in one of the examples:

  1. In 0.4, exclude the activator components from the release (and either --prune should cleanup, or a Job should)

@jonjohnsonjr Ack. Rollback safety in general requires us to introduce a component the release before we enable it by default, which is why I included this caveat:

[1] - If we want rollback safety then in some cases we may want this to be a < b.

Looking at the duality of upgrade / downgrade, this makes sense: You want everyone off of the new component before you remove it. If an upgrade adds it, then its downgrade would remove it. If the same upgrade moves folks onto it, then the downgrade will race to move things off before it is deleted. This is the same reason we must stage component removals.

@mattmoor
Copy link
Member Author

mattmoor commented Dec 9, 2018

Note that for us to start using --prune, we will have to label everything in a release before we start removing anything.

--prune=false: Automatically delete resource objects, including the uninitialized ones, that do not appear in the configs and are created by either apply or create --save-config. Should be used with either -l or --all.

@evankanderson
Copy link
Member

--prune should also work with the --all flag to completely pave over the namespace. This only works if we have a single YAML/set of YAML which completely defines the namespace contents and no objects in other namespaces.

@mattmoor
Copy link
Member Author

So the process of this is touched on here, and will be further codified in a template that @eallred-google is working on.

I think beyond that the serving operator will help rollout changes with non-trivial update mechanics without the multi-cycle staging, so we will evolve this over time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/test-and-release It flags unit/e2e/conformance/perf test issues for product features kind/doc Something isn't clear kind/feature Well-understood/specified features, ready for coding. kind/process Changes in how we work P1 P1
Projects
None yet
Development

No branches or pull requests

5 participants